Generate test cases from requirements
Turn product notes, user stories, and acceptance criteria into structured test ideas across functional, edge-case, accessibility, and risk areas.
Attending Innovate QA? Get George's free QE Interview Prep Kit.
Get the KitQuality engineering AI project
A working quality-engineering lab for generating test cases, comparing model behavior, scoring output, and tracking prompt improvements over time.
Inputs
3
Judges
4
Llama
Generated coverage and judge feedback
Gemini
Generated coverage and judge feedback
Mistral
Generated coverage and judge feedback
The project demonstrates a practical way to evaluate AI-assisted testing instead of treating every generated test case as equally useful.
Turn product notes, user stories, and acceptance criteria into structured test ideas across functional, edge-case, accessibility, and risk areas.
Run multiple LLMs against the same prompt and inspect coverage, specificity, tradeoffs, and missed risks in one review surface.
Capture human scores and LLM-as-judge feedback so prompt revisions can be compared with a repeatable rubric.
Attach UI images when requirements alone do not describe layout, flows, visual states, or interaction details clearly enough.
Hosting approach
The full project is intentionally powerful. Public hosting should show the product clearly while protecting model keys, upload storage, Langfuse data, and API spend.
One demo model by default, not the full research matrix
Server-side API keys only; no browser-exposed secrets
Per-IP request limits before model calls
Durable storage for saved runs and uploaded screenshots
Langfuse tracing for quality review and cost visibility
Private research mode kept separate from the public demo
This is a portfolio project that connects AI experimentation with the work quality leaders actually do: evaluate risk, compare evidence, and improve process.
The tool makes model output comparable across the same product requirements and scoring rubric.
Revisions can be tracked over time so improvements are based on scoring trends, not gut feel.
Rate limits, durable storage, and server-only keys turn a local lab into something safe to share.
This page explains the project and gives recruiters, clients, and conference contacts a clean place to understand the work.
A bounded public version can run one model, cap requests, persist demo runs, and keep the full evaluation lab private.
The multi-model research workflow should remain protected because a single run can fan out into many paid model calls.
Project repository
The current codebase includes the Next.js app, model adapters, Langfuse tracing, image upload flow, and automated tests for the API and evaluation logic.