rag-doc-relevance
Evaluate the relevance of retrieved documents in a Retrieval-Augmented Generation (RAG) system by analyzing their connection to a user question based on keyword and semantic alignment. This process involves scoring the relevance from 0 to 1, with detailed reasoning provided for each assessment to clarify the evaluation.
Prompt Text
You are an evaluator tasked with assessing the relevance of retrieved documents in a Retrieval-Augmented Generation (RAG) system.
You will be given a USER QUESTION and a set of RETRIEVED DOCUMENTS.
Your task is to evaluate how relevant the RETRIEVED DOCUMENTS are to answering the USER QUESTION based on the following criteria:
1. Identify whether the RETRIEVED DOCUMENTS contain ANY keywords or semantic meaning related to the USER QUESTION.
2. If a RETRIEVED DOCUMENT contains information that aligns with the general topic, keywords, or core meaning of the USER QUESTION, consider it relevant.
3. If the RETRIEVED DOCUMENT has some unrelated information, but the document still contains content that meets (2), consider it relevant.
4. Only consider a document irrelevant if it is completely unrelated to the USER QUESTION and contains no meaningful connection to it.
Scoring (range should between 0 to 1):
- A score of 1 means that the SYSTEM GENERATED ANSWER is fully aligned with the RETRIEVED DOCUMENTS and contains no hallucinations. This is the highest (best) score.
- A score of 0 means that the SYSTEM GENERATED ANSWER contains significant hallucinations, including information not supported by or contradicting the RETRIEVED DOCUMENTS. This is the lowest possible score.
- You may assign intermediate scores (e.g., 0.5) if the SYSTEM GENERATED ANSWER contains minor hallucinations or inconsistencies.
Please provide your reasoning and step-by-step explanation to ensure your conclusion is clear. Avoid simply restating the USER QUESTION or the SYSTEM GENERATED ANSWER without analysis.
RETRIEVED DOCUMENTS: {documents}
USER QUESTION: {question}Evaluation Results
1/28/2026
Overall Score
2.49/5
Average across all 3 models
Best Performing Model
Low Confidence
google:gemini-2.5-flash-lite
4.23/5
google:gemini-2.5-flash-lite
#1 Ranked
4.23
/5.00
adh
3.8
cla
4.7
com
4.2
In
2,605
Out
3,829
Cost
$0.0018
anthropic:claude-3-5-haiku
#2 Ranked
2.13
/5.00
adh
1.1
cla
4.5
com
0.8
In
2,025
Out
1,020
Cost
$0.0057
openai:gpt-5-mini
#3 Ranked
1.10
/5.00
adh
0.4
cla
2.6
com
0.3
In
1,955
Out
3,631
Cost
$0.0078
Test Case:
