How We Score Prompts
A transparent look at our evaluation methodology and what the numbers mean.
Understanding the Scores
Overall Score
The average performance across all models. This tells you how well the prompt works in general, regardless of which AI you use.
Winner Score
The score of the best-performing model for this prompt. This shows the prompt's peak potential with the right model.
Models We Test
GPT-5 Mini
OpenAI
Claude 3.5 Haiku
Anthropic
Gemini 2.5 Flash Lite
Every prompt is tested against all three models using identical settings (temperature, max tokens, etc.) to ensure fair comparison.
Scoring Criteria (0-5 Scale)
Adherence
Did the model follow the prompt's instructions exactly? Includes format requirements, constraints, and specific requests.
Clarity
Is the output easy to read and understand? Considers structure, organization, and how well it communicates the answer.
Completeness
Did the model cover everything asked for? Checks if all parts of the prompt were addressed without missing key information.
Multi-Judge Consensus
To avoid bias from any single AI's preferences, we use multiple independent judges to score each output:
Each judge scores independently without seeing other judges' scores. The final score is the average across all judges.
Confidence Levels
When judges disagree significantly, we flag results as lower confidence:
The Evaluation Pipeline
Prompt + Test Cases
Each prompt is paired with 5 test cases specific to its task type (extraction, classification, or summarization).
Generate Outputs
All 3 generator models (GPT, Claude, Gemini) run each test case, producing 15 outputs per prompt.
Judge Each Output
3 independent judges score each output on Adherence, Clarity, and Completeness. That's 135 judge evaluations per prompt.
Aggregate Scores
Judge scores are averaged per output, then per model, giving each model a final score. The best model is declared the winner.
