Our Methodology

How we evaluate and rank prompts across different language models.

Fair Evaluation

We evaluate prompts on three core dimensions: Adherence (following instructions), Clarity (readability and structure), and Completeness (covering all requirements).

Multiple Judges

Multiple independent LLM judges (GPT-4o Mini, Claude 3 Haiku) score each prompt independently to reduce single-model bias.

Consensus Scoring

Final scores use median values weighted by judge agreement. Rankings are determined by score, first-place votes, and variance.

Confidence Levels

High confidence when judges agree within 0.3 points. Medium for 0.5 points. Low confidence flagged when variance exceeds 0.5 points.

Evaluation Process

1

Prompt Ingestion

Prompts are collected from public sources and deduplicated using content hashing.

2

Model Execution

Each prompt is run against multiple generator models (GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro) with identical settings.

3

Judge Evaluation

Independent judge models score each output on Adherence, Clarity, and Completeness (0-5 scale).

4

Score Aggregation

Scores are aggregated across judges to determine final rankings and confidence levels.