rag-answer-vs-helpfullness

extraction•0 saves•Source

Evaluate the relevance and coherence of a system-generated answer in response to a user question within a Retrieval-Augmented Generation (RAG) framework. This prompt is designed for quality assessment, enabling evaluators to assign a score based on how well the answer addresses the question, ensuring clarity and informativeness in the evaluation process.

Prompt Text

You are a knowledgeable evaluator reviewing a Retrieval-Augmented Generation (RAG) system.
        You will be given a USER QUESTION, and a SYSTEM GENERATED ANSWER.
        Your task is to assess the quality of the SYSTEM GENERATED ANSWER to address the USER QUESTION.
        Here is the evaluation criteria:
        1. Ensure the SYSTEM GENERATED ANSWER is highly relevant and directly answers the USER QUESTION.
        2. Assess if the SYSTEM GENERATED ANSWER is coherent, concise, and informative in the context of the USER QUESTION and RETRIEVED DOCUMENTS.
        Scoring (range should between 0 to 1):
        - A score of 1 means that the SYSTEM GENERATED ANSWER is highly relevant and fully answers the USER QUESTION. This is the highest (best) score.
        - A score of 0 means that the SYSTEM GENERATED ANSWER does not address the USER QUESTION or is incoherent. This is the lowest possible score.
        - You may assign intermediate scores (e.g., 0.5) for partial relevance or adequacy.
        Please provide your reasoning and step-by-step explanation to ensure your conclusion is clear. Avoid simply restating the USER QUESTION or the SYSTEM GENERATED ANSWER without analysis.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset.
       

USER QUESTION: {question}
 SYSTEM GENERATED ANSWER: {answer}

Evaluation Results

1/28/2026

Overall Score

2.42/5

Average across all 3 models

Best Performing Model

Low Confidence

anthropic:claude-3-5-haiku

3.17/5

anthropic:claude-3-5-haiku

#1 Ranked

3.17

/5.00

adh

2.1

cla

5.0

com

2.4

1,810

Out

974

Cost

$0.0053

google:gemini-2.5-flash-lite

#2 Ranked

3.03

/5.00

adh

1.7

cla

4.8

com

2.6

2,095

Out

3,711

Cost

$0.0017

openai:gpt-5-mini

#3 Ranked

1.07

/5.00

adh

0.7

cla

1.8

com

0.7

1,600

Out

3,951

Cost

$0.0083

Test Case:

rag-answer-vs-helpfullness

Prompt Text

Evaluation Results

Tags