How We Score Prompts

A transparent look at our evaluation methodology and what the numbers mean.

Understanding the Scores

Overall Score

The average performance across all models. This tells you how well the prompt works in general, regardless of which AI you use.

(GPT score + Claude score + Gemini score) / 3

Winner Score

The score of the best-performing model for this prompt. This shows the prompt's peak potential with the right model.

max(GPT avg, Claude avg, Gemini avg)
Example: If GPT scores 4.5, Claude scores 4.2, and Gemini scores 4.0, the Overall Score is 4.23 and the Winner Score is 4.5 (GPT).

Models We Test

O

GPT-5 Mini

OpenAI

A

Claude 3.5 Haiku

Anthropic

G

Gemini 2.5 Flash Lite

Google

Every prompt is tested against all three models using identical settings (temperature, max tokens, etc.) to ensure fair comparison.

Scoring Criteria (0-5 Scale)

Adh

Adherence

Did the model follow the prompt's instructions exactly? Includes format requirements, constraints, and specific requests.

Cla

Clarity

Is the output easy to read and understand? Considers structure, organization, and how well it communicates the answer.

Com

Completeness

Did the model cover everything asked for? Checks if all parts of the prompt were addressed without missing key information.

The final score for each model is the average of these three criteria across all test cases.

Multi-Judge Consensus

To avoid bias from any single AI's preferences, we use multiple independent judges to score each output:

GPT-5 Mini - OpenAI's judge
Claude 3.5 Haiku - Anthropic's judge
Gemini 2.5 Flash Lite - Google's judge

Each judge scores independently without seeing other judges' scores. The final score is the average across all judges.

Confidence Levels

When judges disagree significantly, we flag results as lower confidence:

High ConfidenceJudges agree within 0.3 points
Medium ConfidenceJudges differ by 0.3-0.5 points
Low ConfidenceJudges differ by more than 0.5 points

The Evaluation Pipeline

1

Prompt + Test Cases

Each prompt is paired with 5 test cases specific to its task type (extraction, classification, or summarization).

2

Generate Outputs

All 3 generator models (GPT, Claude, Gemini) run each test case, producing 15 outputs per prompt.

3

Judge Each Output

3 independent judges score each output on Adherence, Clarity, and Completeness. That's 135 judge evaluations per prompt.

4

Aggregate Scores

Judge scores are averaged per output, then per model, giving each model a final score. The best model is declared the winner.