sciscigpt-tool-eval

extraction•0 saves•Source

Evaluate the effectiveness of a newly implemented tool by assigning a quality score between 0.0 and 1.0, with specific guidelines for interpretation, and provide a brief rationale for low scores if applicable. This prompt is useful for assessing tool performance and guiding future development decisions.

Prompt Text

<system>
<task>
Based on the above, your task is to evaluate the newest tool call using the following steps.
</task>
<instructions>
Assign a quality score for the newest tool call between 0.0 and 1.0 using the <reward> tag:
   - 0.8+: Continue current approach
   - 0.5-0.7: Consider minor adjustments
   - Below 0.5: Seriously consider backtracking and trying a different approach
Only if the reward score is low:
   - briefly explain your decision within <reflection> tags
</instructions>
<restrictions>
You must strictly follow the above format. The response must only include <reward> and (if needed) <reflection> tags.
</restrictions>
</system>

Evaluation Results

1/22/2026

Overall Score

4.84/5

Average across all 3 models

Best Performing Model

High Confidence

openai:gpt-4o-mini

5.00/5

GPT-4o Mini

#1 Ranked

5.00

/5.00

adh

5.0

cla

5.0

com

5.0

google:gemini-1.5-flash

#2 Ranked

5.00

/5.00

adh

5.0

cla

5.0

com

5.0

anthropic:claude-3-haiku

#3 Ranked

4.51

/5.00

adh

4.5

cla

4.5

com

4.5

Test Case:

sciscigpt-tool-eval

Prompt Text

Evaluation Results

Tags