AI-Generated Unit Tests That Actually Catch Bugs: A Real Approach
The Problem with AI-Generated Tests
The Problem with AI-Generated Tests
Let me be honest: most AI-generated tests are worthless. I have seen it dozens of times. A developer asks an AI tool to "write tests for this function," gets back 15 tests that all pass, and feels good about their coverage number. Then a bug ships to production that none of those tests would have caught.
The issue is not that AI cannot write tests. It is that developers ask for tests without specifying what those tests should actually validate. Without business context, edge cases, and failure modes, AI generates what I call "mirror tests" — tests that verify the code does exactly what the code already does. They test the implementation, not the behavior.
Here is how I use AI to generate tests that find real bugs.
The Framework: Behavior-Driven Test Specifications
Before writing a single test, I write a test specification. This is the step most developers skip, and it is the step that makes all the difference.
Read src/services/invoiceService.ts
Before writing any tests, analyze this service and generate a test specification:
For each public function, list:
1. Happy path scenarios (the expected use case)
2. Edge cases (boundaries, empty inputs, maximum values)
3. Error scenarios (what should happen when things fail)
4. Business rule validations (domain-specific logic that must be correct)
5. Integration boundaries (what happens at the seams with external services)
For each scenario, write a one-line description of WHAT BEHAVIOR
is being tested, not what code is being executed.
Format:
Function: createInvoice()
- HAPPY: creates invoice with valid line items and calculates correct total
- EDGE: handles invoice with 0 line items (should reject)
- EDGE: handles line item with quantity of 0 (should exclude from total)
- EDGE: handles maximum number of line items (1000)
- ERROR: returns meaningful error when customer does not exist
- BUSINESS: applies 10% discount when total exceeds $1000
- BUSINESS: rounds totals to 2 decimal places (not 3 or more)
- INTEGRATION: retries Stripe charge up to 3 times on network failure
I review this spec, add any scenarios the AI missed, and remove any that are not relevant. Only then do I move to writing tests.
Step 1: Generate the Test Specification (5 Minutes)
Use the prompt above. Review the output carefully. The most valuable scenarios are usually the business rules and edge cases — those are where real bugs live.
Add your own knowledge. You know the bugs your team has shipped before. You know the edge cases that come up in production. Add those to the spec:
Add these additional scenarios to the test spec:
- REGRESSION: invoice created on Feb 29 should handle leap year correctly
- REGRESSION: customer with special characters in name (apostrophes, unicode)
- CONCURRENCY: two invoices created simultaneously for the same customer
- PERFORMANCE: creating 100 invoices in a batch should complete under 5 seconds
Step 2: Write Tests for Critical Paths First (15 Minutes)
Now generate the actual tests, but do it in focused batches:
Using the test specification we created, write Vitest tests for the
createInvoice() function. Start with the BUSINESS and ERROR scenarios only.
Requirements:
- Use Vitest with describe/it blocks
- Mock external dependencies (database, Stripe) using vi.mock
- Each test name should describe the behavior, not the implementation
- Each test should have a comment explaining what real bug it would catch
- Use arrange/act/assert pattern
- Use realistic test data, not "test" or "foo"
- Assert on behavior (what the function returns/does), not on internals
Do NOT write snapshot tests. Do NOT write tests that just check if
a function was called. Test actual return values and side effects.
Here is what good AI-generated tests look like with this approach:
describe('createInvoice', () => {
// BUG THIS CATCHES: Invoice totals calculated with floating point
// errors (e.g., $10.10 + $20.20 = $30.299999999 instead of $30.30)
it('rounds invoice total to exactly 2 decimal places', async () => {
const lineItems = [
{ description: 'Consulting', quantity: 1, unitPrice: 10.10 },
{ description: 'Development', quantity: 1, unitPrice: 20.20 },
];
const invoice = await createInvoice({
customerId: 'cust_123',
lineItems,
});
expect(invoice.total).toBe(30.30);
expect(invoice.total.toString().split('.')[1]?.length).toBeLessThanOrEqual(2);
});
// BUG THIS CATCHES: Discount applied before tax instead of after,
// resulting in customer being overcharged
it('applies discount before calculating tax', async () => {
const lineItems = [
{ description: 'Annual Plan', quantity: 1, unitPrice: 1200 },
];
const invoice = await createInvoice({
customerId: 'cust_123',
lineItems,
discountPercent: 10,
taxRate: 0.08,
});
// Discount: 1200 * 0.10 = 120, Subtotal: 1080, Tax: 86.40, Total: 1166.40
expect(invoice.subtotal).toBe(1080);
expect(invoice.tax).toBe(86.40);
expect(invoice.total).toBe(1166.40);
});
// BUG THIS CATCHES: System creates invoice for deleted/deactivated customer,
// leading to orphaned invoices and failed payment attempts
it('rejects invoice creation for deactivated customer', async () => {
mockCustomerRepo.findById.mockResolvedValue({
id: 'cust_123',
status: 'deactivated',
});
await expect(
createInvoice({ customerId: 'cust_123', lineItems: validLineItems })
).rejects.toThrow('Cannot create invoice for deactivated customer');
});
});
Notice how every test has a comment explaining what real-world bug it prevents. This makes the test suite a living document of your application's business rules.
Step 3: Edge Case Tests (10 Minutes)
Now write tests for the EDGE scenarios from our specification.
Focus on boundary conditions and unexpected inputs.
For each edge case, the test name should start with
"handles" or "rejects" to make the expected behavior clear.
Pay special attention to:
- Empty arrays and null values
- Numeric boundaries (0, negative, MAX_SAFE_INTEGER)
- String boundaries (empty, very long, special characters)
- Date boundaries (month/year boundaries, leap years, timezone edges)
Step 4: Generate Regression Tests from Bug Reports (10 Minutes)
This is my favorite technique. If you have a bug tracker, feed past bugs to the AI:
Here are 5 bugs that shipped to production in the last 6 months:
1. BUG-234: Invoice total was $0 when all line items had quantity 0
2. BUG-267: Duplicate invoices created when user double-clicked submit
3. BUG-289: Tax calculation wrong for customers in tax-exempt states
4. BUG-301: Invoice PDF showed wrong date (UTC vs local timezone)
5. BUG-315: Discount code applied twice when editing an existing invoice
For each bug, write a regression test that would have caught it.
Include the bug ID in the test name for traceability.
These regression tests are the most valuable tests in your suite because they protect against bugs you have already seen in production.
Step 5: Validate Test Quality (5 Minutes)
After generating all tests, run a quality check:
Review the tests we just wrote. For each test, verify:
1. Does it test behavior or implementation? (Only behavior is acceptable)
2. Would the test still pass if we refactored the internals? (It should)
3. Does the test have exactly ONE assertion focus? (It should)
4. Is the test name clear enough to understand without reading the code?
5. Could this test catch a real bug, or is it testing the obvious?
Flag any tests that are "mirror tests" (just verifying the code does
what the code does) and suggest replacements.
The Metrics That Matter
After implementing this approach on three client projects, here are the numbers:
- Test coverage went from 45% to 85% in two days (not because of more tests, but because of better-targeted tests)
- The tests caught 3 real bugs during the first week that existing tests had missed
- Time to write a comprehensive test suite for a new service: 45 minutes instead of 4 hours
The key insight: AI is excellent at writing test code. It is terrible at deciding what to test. That decision is yours. Give AI a specific test specification, and the generated tests are genuinely useful.