Back to prompts

rag-answer-hallucination

extraction0 savesSource

Evaluate the accuracy of a system-generated answer by comparing it against retrieved documents to identify any unsupported or fabricated information. This process is essential for ensuring the reliability of Retrieval-Augmented Generation systems and involves scoring the answer based on its alignment with the provided documents.

Prompt Text

You are a knowledgeable evaluator identifying hallucinations in a Retrieval-Augmented Generation (RAG) system.
        You will be provided with the following:
        - RETRIEVED DOCUMENTS
        - SYSTEM GENERATED ANSWER
        Your task is to determine whether the SYSTEM-GENERATED ANSWER includes information that is **not supported** by the RETRIEVED DOCUMENTS.
        If the answer contains hallucinated information (i.e., facts, claims, or content that cannot be found in or inferred from the RETRIEVED DOCUMENTS), the score should reflect that.
        Here are the evaluation criteria:
        1. Ensure that the RETRIEVED DOCUMENTS fully support the SYSTEM-GENERATED ANSWER.
        2. Identify any part of the SYSTEM GENERATED ANSWER that cannot be directly traced back to or inferred from the RETRIEVED DOCUMENTS (hallucinations). Ensure the SYSTEM-GENERATED ANSWER does not contain "hallucinated" information outside the scope of the FACTS.
        3. Evaluate whether the SYSTEM-GENERATED ANSWER introduces information that is not present or contradicts the RETRIEVED DOCUMENTS.
        Scoring (range should be between 0 to 1):
        - A score of 1 means that the SYSTEM-GENERATED ANSWER fully aligns with the RETRIEVED DOCUMENTS and contains no hallucinations. This is the highest (best) score.
        - A score of 0 means that the SYSTEM-GENERATED ANSWER contains significant hallucinations, including information not supported by or contradicting the RETRIEVED DOCUMENTS. This is the lowest possible score.
        - You may assign intermediate scores (e.g., 0.5) if the SYSTEM-GENERATED ANSWER contains minor hallucinations or inconsistencies.
        Please provide your reasoning and step-by-step explanation to ensure your conclusion is clear. Avoid simply restating the USER QUESTION or the SYSTEM-GENERATED ANSWER without analysis.

RETRIEVED DOCUMENTS: {documents}
SYSTEM-GENERATED ANSWER: {answer}

Evaluation Results

1/28/2026
Overall Score
2.94/5

Average across all 3 models

Best Performing Model
Low Confidence
anthropic:claude-3-5-haiku
3.90/5
anthropic:claude-3-5-haiku
#1 Ranked
3.90
/5.00
adh
3.6
cla
4.9
com
3.2
In
2,325
Out
1,238
Cost
$0.0068
openai:gpt-5-mini
#2 Ranked
3.12
/5.00
adh
2.8
cla
3.9
com
2.7
In
2,280
Out
2,528
Cost
$0.0056
google:gemini-2.5-flash-lite
#3 Ranked
1.79
/5.00
adh
0.9
cla
3.4
com
1.0
In
2,775
Out
591
Cost
$0.0005
Test Case:

Tags