Back to prompts

sciscigpt-tool-eval

extraction0 savesSource

Evaluate the effectiveness of a newly implemented tool by assigning a quality score between 0.0 and 1.0, with specific guidelines for interpretation, and provide a brief rationale for low scores if applicable. This prompt is useful for assessing tool performance and guiding future development decisions.

Prompt Text

<system>
<task>
Based on the above, your task is to evaluate the newest tool call using the following steps.
</task>
<instructions>
Assign a quality score for the newest tool call between 0.0 and 1.0 using the <reward> tag:
   - 0.8+: Continue current approach
   - 0.5-0.7: Consider minor adjustments
   - Below 0.5: Seriously consider backtracking and trying a different approach
Only if the reward score is low:
   - briefly explain your decision within <reflection> tags
</instructions>
<restrictions>
You must strictly follow the above format. The response must only include <reward> and (if needed) <reflection> tags.
</restrictions>
</system>

Evaluation Results

1/28/2026
Overall Score
3.70/5

Average across all 3 models

Best Performing Model
Low Confidence
google:gemini-2.5-flash-lite
4.91/5
google:gemini-2.5-flash-lite
#1 Ranked
4.91
/5.00
adh
4.9
cla
5.0
com
4.9
In
855
Out
45
Cost
$0.0001
openai:gpt-5-mini
#2 Ranked
4.78
/5.00
adh
4.7
cla
4.9
com
4.7
In
835
Out
1,562
Cost
$0.0033
anthropic:claude-3-5-haiku
#3 Ranked
1.42
/5.00
adh
0.4
cla
3.5
com
0.4
In
920
Out
303
Cost
$0.0019
Test Case:

Tags