We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)

February 25, 2026·ScaledByDesign·

aicode-reviewautomationdeveloper-toolsllm

The Experiment

Six months ago, we built an AI code review bot for a client's engineering team. Not a wrapper around ChatGPT — a purpose-built system that integrates with their GitHub workflow, understands their codebase, and provides structured feedback on every pull request.

The team was skeptical. "AI can't understand our code" was the polite version of the pushback. After 6 months and 3,400 pull requests, we have real data.

What We Built

The bot runs on every PR and provides three types of feedback:

interface CodeReviewResult {
  bugs: Finding[];          // Potential bugs and logic errors
  style: Finding[];         // Style and convention violations
  security: Finding[];      // Security vulnerabilities
  performance: Finding[];   // Performance concerns
  suggestions: Finding[];   // Improvement suggestions
  confidence: number;       // 0-1, how confident the bot is
}
 
// Only comments with confidence > 0.7 are posted to the PR
// Lower confidence findings go to a dashboard for human review

The architecture uses a RAG pipeline with the codebase as context:

PR diff → Chunk into logical changes
       → Retrieve relevant codebase context (similar files, tests, types)
       → Generate structured review with citations
       → Filter by confidence threshold
       → Post to GitHub PR as inline comments

The Data: What It Catches

After 3,400 PRs, here's the breakdown of findings that humans confirmed as valid:

Category	Findings	True Positive Rate	Examples
Missing null checks	312	89%	Unhandled optional chaining, missing undefined guards
Type mismatches	187	94%	Wrong argument types, missing type assertions
Unused imports/vars	456	98%	Dead code that linters sometimes miss in complex cases
SQL injection risks	23	78%	String concatenation in queries, missing parameterization
Race conditions	41	62%	Async operations without proper locks or ordering
Error handling gaps	198	85%	Missing try/catch, swallowed errors, missing error types
API contract violations	89	71%	Response shapes that don't match API specs
Test coverage gaps	267	82%	Missing edge case tests, untested error paths

Overall true positive rate: 83%. That means 17% of the bot's comments were false positives — noise that developers had to read and dismiss.

What It Misses

The bot consistently fails at:

1. Business Logic Errors: The bot can't understand that a discount shouldn't apply to already-discounted items because that's a business rule, not a code pattern.

2. Architectural Concerns: "This service is doing too much" or "this should be a separate module" requires understanding system design intent that the bot doesn't have.

3. Performance at Scale: The bot catches obvious N+1 queries but misses subtler issues like "this works fine at 100 records but will time out at 100K."

4. UX Implications: Code that's technically correct but creates a poor user experience (loading states, error messages, accessibility) is invisible to the bot.

5. Context-Dependent Decisions: "We chose this approach because of X constraint" — the bot often suggests refactors that ignore historical context or business constraints.

The Impact on Team Velocity

Before AI Code Review:
  Average PR review time:     4.2 hours (time to first human review)
  Average review cycles:      2.3 rounds
  PRs merged per dev per week: 3.1
  Bug escape rate:            8.2% (bugs found in staging/production)

After AI Code Review (6 months):
  Average PR review time:     1.8 hours (-57%)
  Average review cycles:      1.6 rounds (-30%)
  PRs merged per dev per week: 4.7 (+52%)
  Bug escape rate:            5.1% (-38%)

The biggest win wasn't catching bugs — it was reducing the first review cycle. The bot catches the mechanical issues (null checks, types, error handling) so human reviewers can focus on architecture, logic, and design.

The Cost

Monthly Cost Breakdown:
  LLM API calls (Claude/GPT-4):  $1,200
  Embedding/RAG infrastructure:  $300
  GitHub Actions compute:        $150
  Engineering maintenance:       ~8 hours/month
  Total:                         ~$2,400/month

ROI Calculation:
  Developer time saved:          ~40 hours/month (across 12-person team)
  At $100/hour fully loaded:     $4,000/month saved
  Bug escape reduction:          ~$2,000/month (estimated incident cost savings)
  Net value:                     ~$3,600/month positive

The Honest Assessment

Worth building? Yes, but only if your team is large enough (8+ engineers) to justify the maintenance overhead.

Replace human reviewers? No. The bot handles ~40% of review feedback (the mechanical stuff). Humans are still essential for the 60% that requires judgment.

Build or buy? For most teams: buy. Tools like CodeRabbit, Sourcery, and GitHub Copilot code review have gotten good. We built custom because the client needed deep codebase context and custom rules. Unless you have specific requirements, start with an off-the-shelf solution.

The 17% false positive problem: This is the biggest risk. If the false positive rate creeps above 20%, developers start ignoring all bot comments. You need active tuning and a feedback loop where developers can thumbs-down bad suggestions.

The AI code review bot isn't a replacement for engineering culture. It's a force multiplier for teams that already have good review practices. If your team doesn't review code at all, a bot won't fix that. If your team reviews code well but slowly, a bot can make them faster.

Build the culture first. Then automate the mechanical parts.

Your Analytics Are Double-Counting Revenue — And Nobody Noticed

We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)

February 25, 2026·ScaledByDesign·

aicode-reviewautomationdeveloper-toolsllm

The Experiment

The team was skeptical. "AI can't understand our code" was the polite version of the pushback. After 6 months and 3,400 pull requests, we have real data.

What We Built

The bot runs on every PR and provides three types of feedback:

interface CodeReviewResult {
  bugs: Finding[];          // Potential bugs and logic errors
  style: Finding[];         // Style and convention violations
  security: Finding[];      // Security vulnerabilities
  performance: Finding[];   // Performance concerns
  suggestions: Finding[];   // Improvement suggestions
  confidence: number;       // 0-1, how confident the bot is
}
 
// Only comments with confidence > 0.7 are posted to the PR
// Lower confidence findings go to a dashboard for human review

The architecture uses a RAG pipeline with the codebase as context:

PR diff → Chunk into logical changes
       → Retrieve relevant codebase context (similar files, tests, types)
       → Generate structured review with citations
       → Filter by confidence threshold
       → Post to GitHub PR as inline comments

The Data: What It Catches

After 3,400 PRs, here's the breakdown of findings that humans confirmed as valid:

Category	Findings	True Positive Rate	Examples
Missing null checks	312	89%	Unhandled optional chaining, missing undefined guards
Type mismatches	187	94%	Wrong argument types, missing type assertions
Unused imports/vars	456	98%	Dead code that linters sometimes miss in complex cases
SQL injection risks	23	78%	String concatenation in queries, missing parameterization
Race conditions	41	62%	Async operations without proper locks or ordering
Error handling gaps	198	85%	Missing try/catch, swallowed errors, missing error types
API contract violations	89	71%	Response shapes that don't match API specs
Test coverage gaps	267	82%	Missing edge case tests, untested error paths

Overall true positive rate: 83%. That means 17% of the bot's comments were false positives — noise that developers had to read and dismiss.

What It Misses

The bot consistently fails at:

1. Business Logic Errors: The bot can't understand that a discount shouldn't apply to already-discounted items because that's a business rule, not a code pattern.

2. Architectural Concerns: "This service is doing too much" or "this should be a separate module" requires understanding system design intent that the bot doesn't have.

3. Performance at Scale: The bot catches obvious N+1 queries but misses subtler issues like "this works fine at 100 records but will time out at 100K."

4. UX Implications: Code that's technically correct but creates a poor user experience (loading states, error messages, accessibility) is invisible to the bot.

5. Context-Dependent Decisions: "We chose this approach because of X constraint" — the bot often suggests refactors that ignore historical context or business constraints.

The Impact on Team Velocity

Before AI Code Review:
  Average PR review time:     4.2 hours (time to first human review)
  Average review cycles:      2.3 rounds
  PRs merged per dev per week: 3.1
  Bug escape rate:            8.2% (bugs found in staging/production)

After AI Code Review (6 months):
  Average PR review time:     1.8 hours (-57%)
  Average review cycles:      1.6 rounds (-30%)
  PRs merged per dev per week: 4.7 (+52%)
  Bug escape rate:            5.1% (-38%)

The Cost

Monthly Cost Breakdown:
  LLM API calls (Claude/GPT-4):  $1,200
  Embedding/RAG infrastructure:  $300
  GitHub Actions compute:        $150
  Engineering maintenance:       ~8 hours/month
  Total:                         ~$2,400/month

ROI Calculation:
  Developer time saved:          ~40 hours/month (across 12-person team)
  At $100/hour fully loaded:     $4,000/month saved
  Bug escape reduction:          ~$2,000/month (estimated incident cost savings)
  Net value:                     ~$3,600/month positive

The Honest Assessment

Worth building? Yes, but only if your team is large enough (8+ engineers) to justify the maintenance overhead.

Replace human reviewers? No. The bot handles ~40% of review feedback (the mechanical stuff). Humans are still essential for the 60% that requires judgment.

Build the culture first. Then automate the mechanical parts.

Your Analytics Are Double-Counting Revenue — And Nobody Noticed

We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)

The Experiment

What We Built

The Data: What It Catches

What It Misses

The Impact on Team Velocity

The Cost

The Honest Assessment

Ready to Ship?

We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)

The Experiment

What We Built

The Data: What It Catches

What It Misses

The Impact on Team Velocity

The Cost

The Honest Assessment

Ready to Ship?