Our Methodology

How we evaluate and rank prompts across different language models.

Fair Evaluation

We evaluate prompts on three core dimensions: Adherence (following instructions), Clarity (readability and structure), and Completeness (covering all requirements).

Multiple Judges

Multiple independent LLM judges (GPT-4o Mini, Claude 3 Haiku) score each prompt independently to reduce single-model bias.

Consensus Scoring

Final scores use median values weighted by judge agreement. Rankings are determined by score, first-place votes, and variance.

Confidence Levels

High confidence when judges agree within 0.3 points. Medium for 0.5 points. Low confidence flagged when variance exceeds 0.5 points.

Evaluation Process

Prompt Ingestion

Prompts are collected from public sources and deduplicated using content hashing.

Model Execution

Each prompt is run against multiple generator models (GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro) with identical settings.

Judge Evaluation

Independent judge models score each output on Adherence, Clarity, and Completeness (0-5 scale).

Score Aggregation

Scores are aggregated across judges to determine final rankings and confidence levels.