Are These AI-Generated Tests Good?
I use AI to generate a lot of tests now. It saves time, but fast output can create false confidence.
Even if AI-generated code is 99% accurate, we still need tests for the same reason as always: catching regressions before users do.
From experience, I can usually spot missing cases during review. But while writing this down, I wanted more than instinct — I wanted a framework that is easy to explain and repeat. ISTQB definitions align well with how I already think, so I use them as the foundation.
Test Level vs Test Type
ISTQB defines test level as:
“A group of test activities that are organized and managed together.”
ISTQB defines test type as:
“A group of test activities aimed at testing specific characteristics.”
That gives me a clear way to evaluate AI-generated tests: classify by level, classify by type, then check coverage and signal.
Running Example
Feature: password reset
Expected behavior:
- User requests reset link
- Email is sent with token
- Token expires
- Token is single-use
- Password updates
- Old password no longer works
Low Signal vs High Signal
Low-signal test:
it "returns success" do
post :create, params: { email: user.email }
expect(response).to have_http_status(:ok)
end
Higher-signal test:
it "creates a single-use token and sends reset email" do
expect {
post :create, params: { email: user.email }
}.to change(PasswordResetToken, :count).by(1)
token = PasswordResetToken.last
expect(token.used_at).to be_nil
expect(ActionMailer::Base.deliveries.last.to).to include(user.email)
end
The second test asserts what changed — token count, token state, email delivery — instead of only that the controller returned 200. That distinction matters when reviewing AI output: AI tends to generate the first kind because it is easier to write and harder to fail.
Actionable Review Algorithm
When I review a PR with AI-generated tests, I run through this sequence:
- Collect all new and changed tests in the PR.
- Classify each test by ISTQB test level.
- Classify each test by ISTQB test type.
- Map tests to Jira acceptance criteria.
- Cross-check requirement docs for edge cases and failure paths.
- Check external impact coverage:
- API contracts
- Downstream consumers
- DB side effects
- Jobs and events
- Third-party boundaries
- Weight confidence:
- Highest: system + acceptance + functional
- Medium: integration + change-related regression
- Lower: component-only checks
- Lowest: smoke/startup/load-only checks
- Publish a short gap summary in the PR.
Scoring Example
PR contains:
- 2 component functional tests
- 1 component integration test
- 1 system functional test
- 1 acceptance regression test
- 2 smoke checks
Initial result: Decent baseline, but incomplete if no system integration test validates the email provider failure path.
Action: Add one integration contract test and re-score.
AI Test Quality Checklist
Use this when reviewing a PR that includes AI-generated tests:
- Tests classified by level (Component, Component Integration, System, System Integration, Acceptance)
- Tests classified by type (Functional, Non-functional, Black-box, White-box, Change-related)
- Jira acceptance criteria mapped to explicit assertions
- Requirement doc edge cases and failure paths covered
- External impact covered (API contracts, downstream consumers, jobs/events, side effects)
- Regression protection added for changed behavior
- High-signal vs low-signal mix summarized
- Known gaps documented with a follow-up plan
AI helps me generate tests faster. This framework helps me trust what I ship.
Sources: