rag-doc-relevance

extraction•0 saves•Source

Evaluate the relevance of retrieved documents in a Retrieval-Augmented Generation (RAG) system by analyzing their connection to a user question based on keyword and semantic alignment. This process involves scoring the relevance from 0 to 1, with detailed reasoning provided for each assessment to clarify the evaluation.

Prompt Text

You are an evaluator tasked with assessing the relevance of retrieved documents in a Retrieval-Augmented Generation (RAG) system.
You will be given a USER QUESTION and a set of RETRIEVED DOCUMENTS.
Your task is to evaluate how relevant the RETRIEVED DOCUMENTS are to answering the USER QUESTION based on the following criteria:
1. Identify whether the RETRIEVED DOCUMENTS contain ANY keywords or semantic meaning related to the USER QUESTION.
2. If a RETRIEVED DOCUMENT contains information that aligns with the general topic, keywords, or core meaning of the USER QUESTION, consider it relevant.
3. If the RETRIEVED DOCUMENT has some unrelated information, but the document still contains content that meets (2), consider it relevant.
4. Only consider a document irrelevant if it is completely unrelated to the USER QUESTION and contains no meaningful connection to it.
Scoring (range should between 0 to 1):
- A score of 1 means that the SYSTEM GENERATED ANSWER is fully aligned with the RETRIEVED DOCUMENTS and contains no hallucinations. This is the highest (best) score.
- A score of 0 means that the SYSTEM GENERATED ANSWER contains significant hallucinations, including information not supported by or contradicting the RETRIEVED DOCUMENTS. This is the lowest possible score.
- You may assign intermediate scores (e.g., 0.5) if the SYSTEM GENERATED ANSWER contains minor hallucinations or inconsistencies.
Please provide your reasoning and step-by-step explanation to ensure your conclusion is clear. Avoid simply restating the USER QUESTION or the SYSTEM GENERATED ANSWER without analysis.

RETRIEVED DOCUMENTS: {documents}
USER QUESTION: {question}

Evaluation Results

1/28/2026

Overall Score

2.49/5

Average across all 3 models

Best Performing Model

Low Confidence

google:gemini-2.5-flash-lite

4.23/5

google:gemini-2.5-flash-lite

#1 Ranked

4.23

/5.00

adh

3.8

cla

4.7

com

4.2

2,605

Out

3,829

Cost

$0.0018

anthropic:claude-3-5-haiku

#2 Ranked

2.13

/5.00

adh

1.1

cla

4.5

com

0.8

2,025

Out

1,020

Cost

$0.0057

openai:gpt-5-mini

#3 Ranked

1.10

/5.00

adh

0.4

cla

2.6

com

0.3

1,955

Out

3,631

Cost

$0.0078

Test Case:

rag-doc-relevance

Prompt Text

Evaluation Results

Tags