Quality engineering AI project

AI Test Case Generator + Eval

A working quality-engineering lab for generating test cases, comparing model behavior, scoring output, and tracking prompt improvements over time.

View Demo Plan View Code

AI test case evaluation

Prompt

Inputs

Judges

Llama

Generated coverage and judge feedback

4.1

Gemini

Generated coverage and judge feedback

3.8

Mistral

Generated coverage and judge feedback

3.4

What it does

The project demonstrates a practical way to evaluate AI-assisted testing instead of treating every generated test case as equally useful.

Generate test cases from requirements

Turn product notes, user stories, and acceptance criteria into structured test ideas across functional, edge-case, accessibility, and risk areas.

Compare model output side by side

Run multiple LLMs against the same prompt and inspect coverage, specificity, tradeoffs, and missed risks in one review surface.

Score quality with humans and judges

Capture human scores and LLM-as-judge feedback so prompt revisions can be compared with a repeatable rubric.

Use screenshots as context

Attach UI images when requirements alone do not describe layout, flows, visual states, or interaction details clearly enough.

Hosting approach

A controlled demo is the right production shape.

The full project is intentionally powerful. Public hosting should show the product clearly while protecting model keys, upload storage, Langfuse data, and API spend.

One demo model by default, not the full research matrix

Server-side API keys only; no browser-exposed secrets

Per-IP request limits before model calls

Durable storage for saved runs and uploaded screenshots

Langfuse tracing for quality review and cost visibility

Private research mode kept separate from the public demo

Why it matters

This is a portfolio project that connects AI experimentation with the work quality leaders actually do: evaluate risk, compare evidence, and improve process.

Model behavior is measurable

The tool makes model output comparable across the same product requirements and scoring rubric.

Prompt changes have evidence

Revisions can be tracked over time so improvements are based on scoring trends, not gut feel.

Production demos need guardrails

Rate limits, durable storage, and server-only keys turn a local lab into something safe to share.

Build path

Live now

Current portfolio page

This page explains the project and gives recruiters, clients, and conference contacts a clean place to understand the work.

Controlled hosted demo

A bounded public version can run one model, cap requests, persist demo runs, and keep the full evaluation lab private.

Private

Full evaluation lab

The multi-model research workflow should remain protected because a single run can fan out into many paid model calls.

Project repository

Review the source and test coverage.

The current codebase includes the Next.js app, model adapters, Langfuse tracing, image upload flow, and automated tests for the API and evaluation logic.

Open GitHub