rag-answer-hallucination

extraction•0 saves•Source

Evaluate the accuracy of a system-generated answer by comparing it against retrieved documents to identify any unsupported or fabricated information. This process is essential for ensuring the reliability of Retrieval-Augmented Generation systems and involves scoring the answer based on its alignment with the provided documents.

Prompt Text

You are a knowledgeable evaluator identifying hallucinations in a Retrieval-Augmented Generation (RAG) system.
You will be provided with the following:
- RETRIEVED DOCUMENTS
- SYSTEM GENERATED ANSWER
Your task is to determine whether the SYSTEM-GENERATED ANSWER includes information that is **not supported** by the RETRIEVED DOCUMENTS.
If the answer contains hallucinated information (i.e., facts, claims, or content that cannot be found in or inferred from the RETRIEVED DOCUMENTS), the score should reflect that.
Here are the evaluation criteria:
1. Ensure that the RETRIEVED DOCUMENTS fully support the SYSTEM-GENERATED ANSWER.
2. Identify any part of the SYSTEM GENERATED ANSWER that cannot be directly traced back to or inferred from the RETRIEVED DOCUMENTS (hallucinations). Ensure the SYSTEM-GENERATED ANSWER does not contain "hallucinated" information outside the scope of the FACTS.
3. Evaluate whether the SYSTEM-GENERATED ANSWER introduces information that is not present or contradicts the RETRIEVED DOCUMENTS.
Scoring (range should be between 0 to 1):
- A score of 1 means that the SYSTEM-GENERATED ANSWER fully aligns with the RETRIEVED DOCUMENTS and contains no hallucinations. This is the highest (best) score.
- A score of 0 means that the SYSTEM-GENERATED ANSWER contains significant hallucinations, including information not supported by or contradicting the RETRIEVED DOCUMENTS. This is the lowest possible score.
- You may assign intermediate scores (e.g., 0.5) if the SYSTEM-GENERATED ANSWER contains minor hallucinations or inconsistencies.
Please provide your reasoning and step-by-step explanation to ensure your conclusion is clear. Avoid simply restating the USER QUESTION or the SYSTEM-GENERATED ANSWER without analysis.

RETRIEVED DOCUMENTS: {documents}
SYSTEM-GENERATED ANSWER: {answer}

Evaluation Results

1/28/2026

Overall Score

2.94/5

Average across all 3 models

Best Performing Model

Low Confidence

anthropic:claude-3-5-haiku

3.90/5

anthropic:claude-3-5-haiku

#1 Ranked

3.90

/5.00

adh

3.6

cla

4.9

com

3.2

2,325

Out

1,238

Cost

$0.0068

openai:gpt-5-mini

#2 Ranked

3.12

/5.00

adh

2.8

cla

3.9

com

2.7

2,280

Out

2,528

Cost

$0.0056

google:gemini-2.5-flash-lite

#3 Ranked

1.79

/5.00

adh

0.9

cla

3.4

com

1.0

2,775

Out

591

Cost

$0.0005

Test Case:

rag-answer-hallucination

Prompt Text

Evaluation Results

Tags