Let Claude Write the Tests, You Just Review: 1,562 Lines in Practice

The last post closed with "119 specs green" for TopicReview. The real next question: who wrote those tests?

Answer: Claude wrote nearly all 1,562 lines of test code. I only reviewed. Over two-plus weeks on production, the maintenance pattern has been only adding new tests, never rewriting old ones.

This post is about why tests are the single best thing to delegate to Claude, what to look at and what to ignore when reviewing, and how far this split actually goes.

Put the Numbers Down First

TopicReview's tests span 7 files:

spec/services/topic_review_service_spec.rb   760 lines (88 tests)
spec/requests/topic_reviews_spec.rb          281 lines (32 tests)
spec/requests/review_appeals_spec.rb         152 lines (16 tests)
spec/requests/review_votes_spec.rb           127 lines
spec/policies/topic_review_policy_spec.rb    109 lines
spec/jobs/close_topic_review_job_spec.rb      71 lines (7 tests)
spec/models/topic_review_spec.rb              62 lines
───────────────────────────────────────────
                                          1,562 lines

Four kinds of tests: service (business logic), request (controller + integration), policy (Pundit authz), job (scheduled jobs).

The initial commit d162f1e ships with Co-Authored-By: Claude Sonnet 4.6, landing 1,100+ of those lines in one go. Every spec-related commit that followed is "Add test for..." — not a single refactor or rewrite:

00393fc Add test for finalize! with zero votes (expired review)
3f53304 Add test for finalize! with legacy votes missing reasoning
3b185da Update specs to use PROVISIONAL_PENALTY constant

Patching gaps, not rework. This detail matters — I come back to it.

Why Tests Are the Best Thing to Hand Off

Four hard reasons:

1. Inputs and outputs are explicit. A test is "given this state → expect this behavior." That's Claude's strongest translation task: turning a spec into an assertion. Business code sometimes has to weigh trade-offs; tests don't.

2. Mechanical × high volume. One describe .open! has to cover "has eligible jurors / no eligible jurors / no topic / already-active review" — four contexts, each with 2–5 it blocks. Humans start cutting corners by the third context. Claude writes the 88th it with the same care as the first.

3. Extremely short feedback loop. Write a test, run rspec, know in seconds whether it passes. Business code takes days of real use to surface issues. A short loop means any Claude mistake gets caught by rspec on the spot — you don't have to babysit.

4. Naturally parallel. it blocks are independent, no hidden coupling, trivially scalable. Generating dozens of isolated tests at once is exactly what Claude is good at.

What to Look at in Review, and What to Ignore

This is the pivot of the whole split.

Ignore:

Whether RSpec syntax is right — Claude almost never gets this wrong
Mock quality — unless there's obvious over-mocking, it's fine
Factory aesthetics — doesn't matter, running is running
Style consistency — if something's off, tell Claude once, it fixes everything

Look at:

Whether the edge cases are actually covered
Whether the test names describe the real intended behavior
Whether any tests that should exist are missing

The last one is review's real value. Claude covers the tests "it can think of," but the tests it can't think of don't get written automatically. That's exactly where human review fits — working backward from business rules to find missing coverage.

A Concrete Example: What Review Actually Catches

Open the start of spec/services/topic_review_service_spec.rb's describe ".open!":

describe ".open!" do
  context "when there are eligible jurors" do
    # review status ok / post under_review / assignments created / author notified / jurors notified / no double-open
  end
  context "when there are no eligible jurors" do
    # review created but no assignments
  end
  context "when post has no topic" do
    # returns nil
  end
end

Looks comprehensive. But the real eligible_jurors rule in the model excludes three groups:

def eligible_jurors
  excluded_ids = [ post.user_id ] + post.reports.pluck(:user_id) + review_votes.where(stage: :initial).pluck(:user_id)
  User.jurors_and_judges.where.not(id: excluded_ids.uniq)
end

Now look at the tests — which test asserts that "the post author is never selected as a juror"?

Search service_spec.rb and model_spec.rb: none. model_spec.rb only tests a few cases of the pending_vote_by scope; it doesn't cover eligible_jurors at all. service_spec.rb has only a comment: # Jurors must NOT be the post author — that's in setup, not an assertion.

This is what review catches: three exclusion rules (author / reporter / already-voted-initial), and none of them is protected by a test. If someone later refactors eligible_jurors and accidentally drops post.user_id from the exclusion list, every existing test passes — and production quietly lets authors sit on their own jury.

Claude wasn't wrong — it tested what it was asked to test. It just didn't spontaneously ask, "do these three rules each need test coverage?" That question — working backward from rules to coverage — is what review is for.

(Honesty: I missed this during the original review. I only spotted it during a second audit while writing this post. So review isn't a one-shot either — but it's still 10× better than no review.)

The Follow-up Commits Prove the Split Works in Practice

If "Claude writes + human reviews" were perfect, there'd be no new test commits after the initial one. What actually happened is more interesting — gap-patching without rewrites:

00393fc Add test for finalize! with zero votes (expired review)
3f53304 Add test for finalize! with legacy votes missing reasoning

The first is a post-bug regression test — e8cb2db Default to keep verdict when review expires with zero votes was the fix, 00393fc the follow-up test. Same pattern for the second, chasing abaa22e Fix CloseTopicReviewJob failing due to reasoning validation on old votes.

These two commits prove two things at once:

Review didn't catch 100% of cases — production exposed two bugs
But the test architecture held up; we could keep adding tests without restructuring — that's why the commits say "Add test for..." and not "Rewrite ... spec"

"Good enough + able to keep patching" is a far more realistic bar than "perfect." Chasing perfect review is what stops you from handing tests off to Claude in the first place. Accepting "good enough" is what makes the split possible.

Tests You Shouldn't Fully Delegate to Claude

Not every test is right for full handoff:

E2E happy-paths — these need a product lens. Claude can write them but often only covers "technically completable," missing "where a user actually gets stuck."
Security tests — need attacker mindset. Claude is conservative, misses non-standard attack surfaces (SQL keyword injection, oversized strings, alternate unicode).
Performance baselines — need real deployment numbers. Claude will guess thresholds.
Large-scale fixture / factory restructuring — this is architecture-level, and belongs back in plan mode, not something review catches.

In those cases, human leads and Claude assists.

The Default Configuration

To turn this split into a runnable default:

Before the feature starts, I explain the business rules (not the RSpec conventions).
Claude writes both implementation and tests.
Run the tests. Passing = continue. Failing = Claude fixes.
I review:
- Not syntax / mock / factory
- Coverage: is every business rule protected by at least one test?
- Interrogate edge cases: "zero rows / null / concurrency / authz breach" — ask them one at a time.
- Read test names — if the name doesn't tell me what it tests, have Claude rename.
Production-discovered bugs come back as regression tests. That's normal wear, not failure.

Closing

Programmers have less mental energy for reading tests than for reading code. Tests are repetitive, mechanical, draining, necessary. All of that describes Claude's sweet spot — it doesn't get bored, doesn't get tired, doesn't cut corners by the 50th it.

Your job isn't "writing tests" — it's "making sure every business rule is covered by a test." One is implementation, the other is judgment. Judgment stays with you; implementation goes to Claude.

119 specs / 1,562 lines landing in one commit and surviving two-plus weeks without rework — that happened not because I'm better at writing tests, but because I didn't write any. I just did one thing Claude doesn't: decide which business rules deserve protection.