Our Methodology
How we evaluate and rank prompts across different language models.
Fair Evaluation
We evaluate prompts on three core dimensions: Adherence (following instructions), Clarity (readability and structure), and Completeness (covering all requirements).
Multiple Judges
Multiple independent LLM judges (GPT-4o Mini, Claude 3 Haiku) score each prompt independently to reduce single-model bias.
Consensus Scoring
Final scores use median values weighted by judge agreement. Rankings are determined by score, first-place votes, and variance.
Confidence Levels
High confidence when judges agree within 0.3 points. Medium for 0.5 points. Low confidence flagged when variance exceeds 0.5 points.
Evaluation Process
Prompt Ingestion
Prompts are collected from public sources and deduplicated using content hashing.
Model Execution
Each prompt is run against multiple generator models (GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro) with identical settings.
Judge Evaluation
Independent judge models score each output on Adherence, Clarity, and Completeness (0-5 scale).
Score Aggregation
Scores are aggregated across judges to determine final rankings and confidence levels.
