Should humans still review all your code?

In the late 90s and 2000s, the open source movement — and Linux kernel development in particular — normalized the idea that all code should be read by someone else before it gets merged. Over the past two decades, code reviews became one of those table-stakes things that serious engineering teams just do. We wrote a complete guide to code reviews a few years ago, and the reasoning behind them felt settled: share knowledge, spread ownership, keep the codebase consistent, and catch issues early.

That reasoning assumed humans were writing all the code. Now, they’re not. And I’m encountering more and more organizations where at least some, if not most, of the code being committed has never been read by a person.

Some of these rather agentic organizations are running enterprise-scale production systems, passing SOC 2 audits, and iterating fast. Meanwhile, the teams still reviewing everything that comes out of the AI firehose are feeling the squeeze.

AI generates code faster than humans can read it, and the time you saved by not writing the code yourself may well get spent reviewing what the AI produced. More than that, though, AI code review tools are getting better in quality all the time, and especially when stacked together, they can catch most defects.

This post won’t argue that your team should drop code reviews entirely, at once. But it will ask which parts of the process are still doing what you think they’re doing, which are starting to break down, and what an alternative might look like. And it’ll keep coming back to one question that I think sits underneath this whole conversation:

If today’s AI is digging you into a hole, can tomorrow’s AI dig you out of it?

Why do we do code reviews in the first place?

Before deciding what to change about your review process, you need to be specific about what code reviews are actually doing for your team. In our guide to code reviews, we broke it down into four goals: sharing knowledge, spreading ownership, unifying development practices, and quality control.

AI-generated code puts pressure on all four of these goals. Not equally, and not in the same ways. Quality control is where most of the current conversation lives, and where the rest of this post will spend its time. But for the other three, let’s go through them now.

Knowledge sharing is one of the most-cited reasons for doing code reviews. When you review a colleague’s PR, you build a mental model of how the system works — where things live, why certain decisions were made, how the pieces connect. But if code changes faster than humans can review it, that model-building breaks down. A ThoughtWorks retreat on the future of software development raised several alternatives: weekly architecture retrospectives, ensemble programming, and AI-assisted comprehension tools that generate system overviews on demand. Whether any of those can replace the slow, ambient learning that happens when engineers read each other’s code is up for…review.

Spreading ownership

This one gets less attention in the current debate, but it might be the hardest piece to replace. In a traditional review, the reviewer stamps “yep, good approach” and from that point on, both the reviewer and the author share responsibility for the decision.

It’s not the same with AI code review. The bot might flag bugs, but nobody is putting their name next to an opinion. And on the author side: when you write code yourself, your deliberate choices tie you to the outcome. If something’s poorly designed, that reflects on you, and that accountability is part of what drives quality.

When AI writes the code, you’re still responsible for shipping it, but you’re less likely to build the intuition that comes from working through the problem. That makes it harder to own the code over time, especially when something breaks six months later and it turns out that “it was something Claude implemented”. Team-level ownership (the kind that keeps people from siloing into “their” part of the codebase) is also harder to maintain when reviews aren’t spreading context across the team.

Unifying development practices

This is where things get extra interesting. Every engineer has their own tendencies, and code reviews have traditionally been how teams narrow the gap between individual styles. Can AGENTS.md files and shared coding agent skills do the same thing? Possibly, for low-level consistency like naming conventions and code structure. But reviews also served as a place where senior engineers pushed back on approaches that technically worked but would create problems down the line. Of course, AI can push back on architectural decisions too, if you’ve defined your architectural boundaries somewhere it can read them. But most teams haven’t, yet.

Human code review was designed for human-speed output

The volume of code being produced looks wildly different than it did even a year ago. Across a sample of 1,450+ engineering organizations in Swarmia, median batch size — the number of lines changed per PR — roughly doubled between Q1 2025 and Q1 2026. Among organizations with at least 1,000 PRs in each period, the median grew by 97.5%; include small teams (100+ PRs) and the number jumps to 109%. And the growth is accelerating: most of that increase happened in the last six months, with batch size climbing roughly 2.5x faster from October onward than it had in the months before. It’s hard not to connect the inflection point to the wave of AI coding tools that hit mainstream adoption in late 2025.

At the same time, CodeRabbit’s analysis of 470 open-source PRs found roughly 1.7 times more issues in AI-coauthored PRs than in human-written ones. So yes, there’s a whole lot more code, and there’s a whole load more bugs to catch.

Engineers also report that reviewing AI-generated code (today, it’s safe to assume that most PRs are at least partly, if not fully, AI-generated) takes more effort than reviewing a colleague’s work. When you review a PR from someone on your team, you have context: you know how they think, what they tend to get right, where they might cut corners or have weaknesses. It’s not like that with AI-generated code.

One of our engineers put it well: you used to be able to trust that a certain coworker had a quality stamp on their work. Now, even an exceptional programmer might use AI for something seemingly small. And since AI detaches the person from the output, it’s harder to know what you’re looking at — or how much attention the author paid to what the AI produced. Plus, the reviewer of your AI-generated PR can inadvertently end up doing a bunch of quality gate work, which is not their job. Or at least it shouldn’t be.

There’s already precedent for not reading machine-generated output. Nobody inspects the assembly code a compiler produces, because compilers are deterministic. They do the same thing every time, and we’ve had decades to verify that. But AI-generated code isn’t there yet, not by a long shot. So perhaps at this point, we should be trying to figure out what needs to be true before humans can stop reviewing code.

What does “quality” mean, and can AI assure it?

When people say code reviews catch quality issues, it helps to be specific about what kind of quality they mean. There are two dimensions to it, and they’re very different problems.

Defects: does the thing work?

Does the code have bugs? Does it pass the tests? Does it handle edge cases? Are there security vulnerabilities? AI code review tools are already decent at catching many of these, and arguably better than human reviewers for certain classes of issues. In practice, most teams already rely on CI to catch the things that actually break production. Investing in better continuous integration that catches more of those issues might be a better use of time than reviewing every line.

Design: will it slow you down later?

This is the harder question, and the one with higher long-term stakes. Even if something technically works today, will it come back to haunt you in six months? Catching that kind of problem has traditionally required experience, context, and taste in the form of a senior engineer who looks at a PR and says “yeah, this works, but it’s going to make our lives harder later.”

There’s some proper evidence that AI is making this problem worse too. A Carnegie Mellon study of 807 GitHub repos found that Cursor adoption increased cognitive complexity by roughly 41% and static analysis warnings by about 30%. The complexity stuck around even as teams got more familiar with the tools.

And this is where that central question of this post comes into focus. If AI-generated code is accumulating complexity faster than your team can manage it, you’re making a bet: that future models will be capable enough to untangle whatever the current ones are creating. Maybe they will be. But if your codebase becomes an irrecoverable mess, that’s a business problem and not just a technical one. How much of that bet are you comfortable with — and whose fault is it if the bet doesn’t pay off?

Regardless of where you land, one thing doesn’t change: your job is to deliver code you’ve proven works, and to include proof that it works. Almost anyone can prompt an LLM to produce a thousand-line patch. What’s important is whether the person submitting it can demonstrate that it does what it’s supposed to do. The proof doesn’t have to be a human reading every line, but it has to be something.

Not all code carries the same risk

When you want to ask “should a human still review this?”, you need to think about what “this” is. Changing the color of a button is different from modifying authentication logic. A bug in a mobile game, while bad for the user experience, is not the same as a bug in medical device software.

A few questions can help you triage the problem:

What’s the cost of a bug slipping through? If you can deploy to a small percentage of users first, roll back in minutes, and detect issues through monitoring, the stakes of any individual PR are lower. But if a mistake means a compliance violation or a real-world safety issue, they’re not.
Are you comfortable with your team not fully understanding this part of the codebase? If AI is producing code faster than people can build a mental model of it, you may be able to ship just fine today but find it increasingly difficult to reason about the system six months from now. This is a bigger concern for core domains and long-lived systems than for prototypes or isolated services.
Can you assign different levels of scrutiny to different kinds of changes? Some teams are starting to do this — so applying stricter review requirements to changes that touch sensitive areas (payments, auth, infrastructure) while allowing lower-risk changes to ship with only automated checks. This is a more realistic near-term step than treating every PR the same way or dropping review entirely.
How strong is your safety net below the review step? Code review was never great at catching the things that actually break production, like interactions between components, behavior under load, and timing issues. In practice, most teams rely on CI and monitoring for that. So it’s fair to say that the stronger your testing, observability, and deployment pipeline, the less weight the code review step needs to carry on its own.
Are you in a compliance environment? SOC 2 Type 2 requires that code changes are controlled, tested, and approved before they ship — but the standard doesn’t prescribe how. Automated static analysis, testing pipelines, and deployment gates can satisfy the requirement, as long as they’re deliberate and documented.

What a new code review process could look like

If not human code review for every PR, then what? The Swiss cheese model, perhaps: no single gate catches everything, so you stack imperfect filters different ways until the holes cover each other up.

Some of this is table stakes already, like static analysis (linters, type checkers, SonarQube, dependency scanners), automated tests, AI code review tools, production monitoring. You really should already have these. They do matter more when AI is writing the code, but they’re not new.

Test-driven development (TDD) matters a lot more now, as well. When tests exist before the code, agents can’t cheat by writing a test that confirms whatever broken implementation they produced. If you’re not writing tests first, you should at least have your agent roll back the implementation to observe a failing test.

On the AI code review side, running multiple reviewers in parallel helps — some teams run four or five on every PR. They’re good at catching defects. They’re not yet taking a stance on whether your abstractions make sense or your architecture is heading somewhere bad.

Beyond the baseline, there’s also newer stuff:

Adversarial verification, or separating the agent doing the work from the agent judging it. The coding agent doesn’t know what the verification agent will check. The verification agent can’t modify the code to make its own job easier. Anthropic’s engineering team found that this works — agents left to evaluate their own output tend to confidently praise it, even when it’s mediocre. Separate the roles, and you get a much more honest assessment.
Specification-driven development (SDD) moves the human checkpoint upstream. Instead of reviewing code, you review the plan that produces it. If AI generates code from a spec, the spec is the thing to get right. Bad specs produce bad code at scale. Your AI review agents should have access to the spec, so that they can verify if the code matches it.
Black-box acceptance testing means testing what the system does, not how the code looks. You define expected behavior and verify it, full stop. StrongDM’s AI team has gone quite far with this — building full clones of third-party services to test against at scale, with test scenarios stored outside the codebase so coding agents can’t teach to the test.

These layers stacked together can do a lot of what code review was supposed to do — and for catching defects specifically, probably do it better.

So what should you do?

The process we have today was designed for a world where humans wrote all the code and humans reviewed all the code. That world is going away, probably faster than most of us expected.

I don’t think the answer is to drop code reviews tomorrow. But I also don’t think the answer is to keep doing what we’ve been doing and hope the volume problem sorts itself out.

The teams I’ve been talking to who are handling this well are:

Identifying the types of PRs that do not benefit from human reviews
Building the capacity for agents to do better code reviews (e.g. writing high-quality Markdown files)
Creating feedback loops that provide real data about how the process is working (e.g. with surveys, KTLO, DORA, bug metrics)
Updating code review rules regularly, based on the data above

In a year from now, these teams won’t have to regret that they took mountains of tech debt or complexity by dropping code reviews on a whim. Nor will they need to regret that competitors flew past them with the help of AI. Rather, they’ll get to maximize the benefits of AI code reviews while addressing the biggest risks.

None of us knows how much is left for humans to review once the tools are good enough to replace this part of the job. Perhaps less than most of us expect.

Adapting your process is easier when you can see what’s changing

Swarmia connects the dots between AI adoption, usage, and agent activity — and their impact on your engineering metrics.

Learn more