We Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)
The Experiment
Six months ago, we built an AI code review bot for a client's engineering team. Not a wrapper around ChatGPT — a purpose-built system that integrates with their GitHub workflow, understands their codebase, and provides structured feedback on every pull request.
The team was skeptical. "AI can't understand our code" was the polite version of the pushback. After 6 months and 3,400 pull requests, we have real data.
What We Built
The bot runs on every PR and provides three types of feedback:
interface CodeReviewResult {
bugs: Finding[]; // Potential bugs and logic errors
style: Finding[]; // Style and convention violations
security: Finding[]; // Security vulnerabilities
performance: Finding[]; // Performance concerns
suggestions: Finding[]; // Improvement suggestions
confidence: number; // 0-1, how confident the bot is
}
// Only comments with confidence > 0.7 are posted to the PR
// Lower confidence findings go to a dashboard for human reviewThe architecture uses a RAG pipeline with the codebase as context:
PR diff → Chunk into logical changes
→ Retrieve relevant codebase context (similar files, tests, types)
→ Generate structured review with citations
→ Filter by confidence threshold
→ Post to GitHub PR as inline comments
The Data: What It Catches
After 3,400 PRs, here's the breakdown of findings that humans confirmed as valid:
| Category | Findings | True Positive Rate | Examples |
|---|---|---|---|
| Missing null checks | 312 | 89% | Unhandled optional chaining, missing undefined guards |
| Type mismatches | 187 | 94% | Wrong argument types, missing type assertions |
| Unused imports/vars | 456 | 98% | Dead code that linters sometimes miss in complex cases |
| SQL injection risks | 23 | 78% | String concatenation in queries, missing parameterization |
| Race conditions | 41 | 62% | Async operations without proper locks or ordering |
| Error handling gaps | 198 | 85% | Missing try/catch, swallowed errors, missing error types |
| API contract violations | 89 | 71% | Response shapes that don't match API specs |
| Test coverage gaps | 267 | 82% | Missing edge case tests, untested error paths |
Overall true positive rate: 83%. That means 17% of the bot's comments were false positives — noise that developers had to read and dismiss.
What It Misses
The bot consistently fails at:
1. Business Logic Errors: The bot can't understand that a discount shouldn't apply to already-discounted items because that's a business rule, not a code pattern.
2. Architectural Concerns: "This service is doing too much" or "this should be a separate module" requires understanding system design intent that the bot doesn't have.
3. Performance at Scale: The bot catches obvious N+1 queries but misses subtler issues like "this works fine at 100 records but will time out at 100K."
4. UX Implications: Code that's technically correct but creates a poor user experience (loading states, error messages, accessibility) is invisible to the bot.
5. Context-Dependent Decisions: "We chose this approach because of X constraint" — the bot often suggests refactors that ignore historical context or business constraints.
The Impact on Team Velocity
Before AI Code Review:
Average PR review time: 4.2 hours (time to first human review)
Average review cycles: 2.3 rounds
PRs merged per dev per week: 3.1
Bug escape rate: 8.2% (bugs found in staging/production)
After AI Code Review (6 months):
Average PR review time: 1.8 hours (-57%)
Average review cycles: 1.6 rounds (-30%)
PRs merged per dev per week: 4.7 (+52%)
Bug escape rate: 5.1% (-38%)
The biggest win wasn't catching bugs — it was reducing the first review cycle. The bot catches the mechanical issues (null checks, types, error handling) so human reviewers can focus on architecture, logic, and design.
The Cost
Monthly Cost Breakdown:
LLM API calls (Claude/GPT-4): $1,200
Embedding/RAG infrastructure: $300
GitHub Actions compute: $150
Engineering maintenance: ~8 hours/month
Total: ~$2,400/month
ROI Calculation:
Developer time saved: ~40 hours/month (across 12-person team)
At $100/hour fully loaded: $4,000/month saved
Bug escape reduction: ~$2,000/month (estimated incident cost savings)
Net value: ~$3,600/month positive
The Honest Assessment
Worth building? Yes, but only if your team is large enough (8+ engineers) to justify the maintenance overhead.
Replace human reviewers? No. The bot handles ~40% of review feedback (the mechanical stuff). Humans are still essential for the 60% that requires judgment.
Build or buy? For most teams: buy. Tools like CodeRabbit, Sourcery, and GitHub Copilot code review have gotten good. We built custom because the client needed deep codebase context and custom rules. Unless you have specific requirements, start with an off-the-shelf solution.
The 17% false positive problem: This is the biggest risk. If the false positive rate creeps above 20%, developers start ignoring all bot comments. You need active tuning and a feedback loop where developers can thumbs-down bad suggestions.
The AI code review bot isn't a replacement for engineering culture. It's a force multiplier for teams that already have good review practices. If your team doesn't review code at all, a bot won't fix that. If your team reviews code well but slowly, a bot can make them faster.
Build the culture first. Then automate the mechanical parts.