sciscigpt-tool-eval
Evaluate the effectiveness of a newly implemented tool by assigning a quality score between 0.0 and 1.0, with specific guidelines for interpretation, and provide a brief rationale for low scores if applicable. This prompt is useful for assessing tool performance and guiding future development decisions.
Prompt Text
<system> <task> Based on the above, your task is to evaluate the newest tool call using the following steps. </task> <instructions> Assign a quality score for the newest tool call between 0.0 and 1.0 using the <reward> tag: - 0.8+: Continue current approach - 0.5-0.7: Consider minor adjustments - Below 0.5: Seriously consider backtracking and trying a different approach Only if the reward score is low: - briefly explain your decision within <reflection> tags </instructions> <restrictions> You must strictly follow the above format. The response must only include <reward> and (if needed) <reflection> tags. </restrictions> </system>
Evaluation Results
1/28/2026
Overall Score
3.70/5
Average across all 3 models
Best Performing Model
Low Confidence
google:gemini-2.5-flash-lite
4.91/5
google:gemini-2.5-flash-lite
#1 Ranked
4.91
/5.00
adh
4.9
cla
5.0
com
4.9
In
855
Out
45
Cost
$0.0001
openai:gpt-5-mini
#2 Ranked
4.78
/5.00
adh
4.7
cla
4.9
com
4.7
In
835
Out
1,562
Cost
$0.0033
anthropic:claude-3-5-haiku
#3 Ranked
1.42
/5.00
adh
0.4
cla
3.5
com
0.4
In
920
Out
303
Cost
$0.0019
Test Case:
