In the late 90s and 2000s, the open source movement — and Linux kernel development in particular — normalized the idea that all code should be read by someone else before it gets merged. Over the past two decades, code reviews became one of those table-stakes things that serious engineering teams just do. We wrote a complete guide to code reviews a few years ago, and the reasoning behind them felt settled: share knowledge, spread ownership, keep the codebase consistent, and catch issues early.
That reasoning assumed humans were writing all the code. Now, they’re not. And I’m encountering more and more organizations where at least some, if not most, of the code being committed has never been read by a person.
Some of these rather agentic organizations are running enterprise-scale production systems, passing SOC 2 audits, and iterating fast. Meanwhile, the teams still reviewing everything that comes out of the AI firehose are feeling the squeeze.
AI generates code faster than humans can read it, and the time you saved by not writing the code yourself may well get spent reviewing what the AI produced. More than that, though, AI code review tools are getting better in quality all the time, and especially when stacked together, they can catch most defects.
This post won’t argue that your team should drop code reviews entirely, at once. But it will ask which parts of the process are still doing what you think they’re doing, which are starting to break down, and what an alternative might look like. And it’ll keep coming back to one question that I think sits underneath this whole conversation:
If today’s AI is digging you into a hole, can tomorrow’s AI dig you out of it?
Before deciding what to change about your review process, you need to be specific about what code reviews are actually doing for your team. In our guide to code reviews, we broke it down into four goals: sharing knowledge, spreading ownership, unifying development practices, and quality control.
AI-generated code puts pressure on all four of these goals. Not equally, and not in the same ways. Quality control is where most of the current conversation lives, and where the rest of this post will spend its time. But for the other three, let’s go through them now.
Knowledge sharing is one of the most-cited reasons for doing code reviews. When you review a colleague’s PR, you build a mental model of how the system works — where things live, why certain decisions were made, how the pieces connect. But if code changes faster than humans can review it, that model-building breaks down. A ThoughtWorks retreat on the future of software development raised several alternatives: weekly architecture retrospectives, ensemble programming, and AI-assisted comprehension tools that generate system overviews on demand. Whether any of those can replace the slow, ambient learning that happens through reviews is up for…review.
This one gets less attention in the current debate, but it might be the hardest piece to replace. In a traditional review, the reviewer stamps “yep, good approach” and from that point on, both the reviewer and the author share responsibility for the decision.
It’s not the same with AI code review. The bot might flag bugs, but nobody is putting their name next to an opinion. And on the author side: when you write code yourself, your deliberate choices tie you to the outcome. If something’s poorly designed, that reflects on you, and that accountability is part of what drives quality.
When AI writes the code, you’re still responsible for shipping it, but you’re less likely to build the intuition that comes from working through the problem. That makes it harder to own the code over time, especially when something breaks six months later and it turns out that “it was something Claude implemented”. Team-level ownership (the kind that keeps people from siloing into “their” part of the codebase) is also harder to maintain when reviews aren’t spreading context across the team.
This is where things get extra interesting. Every engineer has their own tendencies, and code reviews have traditionally been how teams narrow the gap between individual styles. Can AGENTS.md files and shared coding agent skills do the same thing? Possibly, for low-level consistency like naming conventions and code structure. But reviews also served as a place where senior engineers pushed back on approaches that technically worked but would create problems down the line. Of course, AI can push back on architectural decisions too, if you’ve defined your architectural boundaries somewhere it can read them. But most teams haven’t, yet.
The volume of code being produced looks wildly different than it did even a year ago. Across a sample of 1,450+ engineering organizations in Swarmia, median batch size — the number of lines changed per PR — roughly doubled between Q1 2025 and Q1 2026. Among organizations with at least 1,000 PRs in each period, the median grew by 97.5%; include small teams (100+ PRs) and the number jumps to 109%. And the growth is accelerating: most of that increase happened in the last six months, with batch size climbing roughly 2.5x faster from October onward than it had in the months before. It’s hard not to connect the inflection point to the wave of AI coding tools that hit mainstream adoption in late 2025.
At the same time, CodeRabbit’s analysis of 470 open-source PRs found roughly 1.7 times more issues in AI-coauthored PRs than in human-written ones. So yes, there’s a whole lot more code, and there’s a whole load more bugs to catch.
Engineers also report that reviewing AI-generated code (today, it’s safe to assume that most PRs are at least partly, if not fully, AI-generated) takes more effort than reviewing a colleague’s work. When you review a PR from someone on your team, you have context: you know how they think, what they tend to get right, where they might cut corners or have weaknesses. It’s not like that with AI-generated code.
One of our engineers put it well: you used to be able to trust that a certain coworker had a quality stamp on their work. Now, even an exceptional programmer might use AI for something seemingly small. And since AI detaches the person from the output, it’s harder to know what you’re looking at — or how much attention the author paid to what the AI produced. Plus, the reviewer of your AI-generated PR can inadvertently end up doing a bunch of quality gate work, which is not their job. Or at least it shouldn’t be.
There’s already precedent for not reading machine-generated output. Nobody inspects the assembly code a compiler produces, because compilers are deterministic. They do the same thing every time, and we’ve had decades to verify that. But AI-generated code isn’t there yet, not by a long shot. So perhaps at this point, we should be trying to figure out what needs to be true before humans can stop reviewing code.
When people say code reviews catch quality issues, it helps to be specific about what kind of quality they mean. There are two dimensions to it, and they’re very different problems.
Does the code have bugs? Does it pass the tests? Does it handle edge cases? Are there security vulnerabilities? AI code review tools are already decent at catching many of these, and arguably better than human reviewers for certain classes of issues. In practice, most teams already rely on CI to catch the things that actually break production. Investing in better continuous integration that catches more of those issues might be a better use of time than reviewing every line.
This is the harder question, and the one with higher long-term stakes. Even if something technically works today, will it come back to haunt you in six months? Catching that kind of problem has traditionally required experience, context, and taste in the form of a senior engineer who looks at a PR and says “yeah, this works, but it’s going to make our lives harder later.”
There’s some proper evidence that AI is making this problem worse too. A Carnegie Mellon study of 807 GitHub repos found that Cursor adoption increased cognitive complexity by roughly 41% and static analysis warnings by about 30%. The complexity stuck around even as teams got more familiar with the tools.
And this is where that central question of this post comes into focus. If AI-generated code is accumulating complexity faster than your team can manage it, you’re making a bet: that future models will be capable enough to untangle whatever the current ones are creating. Maybe they will be. But if your codebase becomes an irrecoverable mess, that’s a business problem and not just a technical one. How much of that bet are you comfortable with — and whose fault is it if the bet doesn’t pay off?
Regardless of where you land, one thing doesn’t change: your job is to deliver code you’ve proven works, and to include proof that it works. Almost anyone can prompt an LLM to produce a thousand-line patch. What’s important is whether the person submitting it can demonstrate that it does what it’s supposed to do. The proof doesn’t have to be a human reading every line, but it has to be something.
When you want to ask “should a human still review this?”, you need to think about what “this” is. Changing the color of a button is different from modifying authentication logic. A bug in a mobile game, while bad for the user experience, is not the same as a bug in medical device software.
A few questions can help you triage the problem:
If not human code review for every PR, then what? The Swiss cheese model, perhaps: no single gate catches everything, so you stack imperfect filters different ways until the holes cover each other up.
Some of this is table stakes already, like static analysis (linters, type checkers, SonarQube, dependency scanners), automated tests, AI code review tools, production monitoring. You really should already have these. They do matter more when AI is writing the code, but they’re not new.
Test-driven development (TDD) matters a lot more now, as well. When tests exist before the code, agents can’t cheat by writing a test that confirms whatever broken implementation they produced. If you’re not writing tests first, you should at least have your agent roll back the implementation to observe a failing test.
On the AI code review side, running multiple reviewers in parallel helps — some teams run four or five on every PR. They’re good at catching defects. They’re not yet taking a stance on whether your abstractions make sense or your architecture is heading somewhere bad.
Beyond the baseline, there’s also newer stuff:
These layers stacked together can do a lot of what code review was supposed to do — and for catching defects specifically, probably do it better.
The process we have today was designed for a world where humans wrote all the code and humans reviewed all the code. That world is going away, probably faster than most of us expected.
I don’t think the answer is to drop code reviews tomorrow. But I also don’t think the answer is to keep doing what we’ve been doing and hope the volume problem sorts itself out.
The teams I’ve been talking to who are handling this well are:
In a year from now, these teams won’t have to regret that they took mountains of tech debt or complexity by dropping code reviews on a whim. Nor will they need to regret that competitors flew past them with the help of AI. Rather, they’ll get to maximize the benefits of AI code reviews while addressing the biggest risks.
None of us knows how much is left for humans to review once the tools are good enough to replace this part of the job. Perhaps less than most of us expect.
Subscribe to our newsletter
Get the latest product updates and #goodreads delivered to your inbox once a month.
