Barron Wasteland

The Problem with Code Coverage

You write tests. You get high code coverage. You feel great. But code coverage only tells you what code ran - not whether your tests actually verified anything. You could hit every line and still have assertions that would pass no matter what the code did.

Mutation testing attempts to address this. It introduces small code changes ("mutants"): swapping > to >=, replacing a parameter with None, etc and it checks whether your tests catch them. If your tests still pass after one of these changes, that mutant "survived," meaning your tests didn't verify that behavior.

I'd been running mutmut on my small app, cat tracker. It worked, but my problem was scale. 468 mutants were generated, 156 survived. The old workflow was manual: browse the mutants, pick one, write a test, rerun. Most survivors were noise and would require me to filter them out

What Meta Did

While I was looking into this I found a paper Meta published with their approach: Mutation-Guided LLM-based Test Generation. Their system, ACH (Automated Compliance Hardening), closes the loop entirely. Mutate code, have an LLM decide if the mutation matters, have an LLM write the test, verify the test kills the mutant, repeat. Pretty cool

The Closed Loop

To accomplish this I built mutation-agent, a tool that automates the full pipeline. Here's what it does:

Mutate → Triage → If valid, Generate Test → Verify → (loop) Fix → Verify → ... → If successful, Commit

Step 1: Mutate Run a mutation engine against the source file. I support two: mutmut for fast, rule-based mutations and mutahunter for LLM-powered semantic mutations that produce more realistic bugs.

Step 2: Triage Where we filter out the noise. Each survived mutant's diff, surrounding source code, and the existing test file gets sent to an LLM. Mutants that are worth killing (they're real oversights) should pass as "interesting surviving mutants".

Step 3: Generate For each mutant that passes triage, the LLM gets the source file, existing tests, mutation diff, and specific instructions: write a test that passes against the original code and fails against the mutant.

Step 4: Verify First, pytest runs against the original code to verify it passes. Then the source file is swapped with the mutant, and pytest runs again to check to see that the test must fail. If both are successful the mutant is killed.

Step 5: Self-correct If the verification fails we attempt to fix it. This then repeats steps 4-5 up to five times. If it fails then we roll back and don't change the code.

Step 6: Commit If this works we can create a PR that describes what took place and what it addresses.

Using It

The tool is at mutation-agent. Point it at any Python file with an existing test file:

pip install -e .
export OPENAI_API_KEY=sk-...

# Interactive — approve each test
python -m mutation_agent --source-file my_module.py --test-file test_my_module.py

# Fully automated — generates, verifies, creates PR
python -m mutation_agent --source-file my_module.py --test-file test_my_module.py --auto

It works with any Python project. The prompts are generic — no project-specific knowledge baked in. The LLM reads your source code and existing tests and figures out what to mock and assert.

Should You Use It?

For code that has decent test coverage and is important to your system this is a powerful tool. It finds real gaps in your code.

The cost is low. A full run against a small module costs maybe a dollar in API calls and takes maybe a few hours.

The signal:noise ratio is still a bit annoying and could be improved with a better triage step. After all this I plan on using it as a periodic audit on code bases to help my testing.