Code review agents went from demos to default CI check in 18 months. Eight tools compete for the spot on your pull-request pipeline. We ran each one against a real codebase of 200 PRs with known bugs — here is how they scored.
How we tested
200 real merged PRs from a mid-size Node/TypeScript codebase. Each PR carried at least one known issue we had caught in manual review. We measured:
- True positive rate — real issues flagged.
- Noise ratio — comments per PR, minus the true positives.
- Large-diff behaviour — did it give up, truncate, or chunk properly?
- Cost per PR.
The 8 tools, ranked
1. CodeRabbit
Top true-positive rate in our test. Handles large diffs by chunking. Comments feel like a thoughtful reviewer, not a linter.
- Strengths: signal-to-noise, diff-aware, rich CLI.
- Weaknesses: can be chatty on design PRs.
2. Greptile
Whole-repo context makes it catch cross-file issues others miss. Slower on first PR of a repo (indexing).
- Strengths: architectural nits, cross-file reasoning.
- Weaknesses: indexing cost; smaller IDE footprint.
3. Claude GitHub App
First-party from Anthropic. Good default prompts, integrates with Claude Code. Tuned for explainer-style feedback.
- Strengths: quality of explanation, no vendor overlay.
- Weaknesses: fewer guardrails than purpose-built SaaS.
4. Graphite Reviewer
Best PR UX overall. The review sits alongside Graphite’s stacked-diff interface, which agents love.
- Strengths: stacked diffs, fast UI.
- Weaknesses: coupled to Graphite’s workflow.
5. Cursor review
Local agent that reviews before you push. Zero-latency feedback, natural for teams already on Cursor.
- Strengths: pre-push feedback, IDE-native.
- Weaknesses: not a gate in CI by default.
6. Sourcery
Leans lint-like. Strong on refactor suggestions, weaker on architectural issues.
- Strengths: consistent style suggestions.
- Weaknesses: comment fatigue on large PRs.
7. Copilot Review
GitHub-native, frictionless for teams already on GitHub Copilot.
- Strengths: zero-setup if you already have Copilot.
- Weaknesses: middling signal-to-noise.
8. DeepCode (Snyk Code)
Security-first. Not a general reviewer but catches bugs others miss in the vulnerability class.
- Strengths: known-vuln patterns.
- Weaknesses: narrow scope on general reviews.
Scorecard
| Tool | TP rate | Noise | Large diff | Cost/PR |
|---|---|---|---|---|
| CodeRabbit | 71% | Low | Good | $$ |
| Greptile | 68% | Low | Excellent | $$$ |
| Claude GH App | 64% | Medium | Good | $$ |
| Graphite Reviewer | 62% | Low | Good | $$ |
| Cursor review | 58% | Medium | Fair | $ |
| Sourcery | 54% | High | Fair | $ |
| Copilot Review | 50% | Medium | Good | $$ |
| DeepCode | 47%* | Low | Good | $$ |
*DeepCode’s TP rate is high inside its security niche, but it does not engage with non-security issues.
What separates the top 3
- Whole-repo context, not just the diff.
- Model choice aligned with task complexity (Opus for architecture, Haiku for style).
- Actionable comments with suggested fix, not questions.
Rolling out without reviewer fatigue
- Start in advisory mode, not gating.
- Track comment accuracy by author thumbs-up rate; alert when it drops below 60%.
- Give authors a "dismiss all" button; the agent must earn the next comment on the next PR.
- Review rules every quarter; trim what consistently gets dismissed.
What is coming next
Three shifts in 2027 worth watching: fix-generating agents (write the patch, not the comment), repo-wide refactor agents scheduled off-hours, and policy agents that block merges against compliance rules.