sciscigpt-tool-eval
Evaluate the effectiveness of a newly implemented tool by assigning a quality score between 0.0 and 1.0, with specific guidelines for interpretation, and provide a brief rationale for low scores if applicable. This prompt is useful for assessing tool performance and guiding future development decisions.
Prompt Text
<system> <task> Based on the above, your task is to evaluate the newest tool call using the following steps. </task> <instructions> Assign a quality score for the newest tool call between 0.0 and 1.0 using the <reward> tag: - 0.8+: Continue current approach - 0.5-0.7: Consider minor adjustments - Below 0.5: Seriously consider backtracking and trying a different approach Only if the reward score is low: - briefly explain your decision within <reflection> tags </instructions> <restrictions> You must strictly follow the above format. The response must only include <reward> and (if needed) <reflection> tags. </restrictions> </system>
Evaluation Results
1/22/2026
Overall Score
4.84/5
Average across all 3 models
Best Performing Model
High Confidence
openai:gpt-4o-mini
5.00/5
GPT-4o Mini
#1 Ranked
5.00
/5.00
adh
5.0
cla
5.0
com
5.0
google:gemini-1.5-flash
#2 Ranked
5.00
/5.00
adh
5.0
cla
5.0
com
5.0
anthropic:claude-3-haiku
#3 Ranked
4.51
/5.00
adh
4.5
cla
4.5
com
4.5
Test Case:
Tags
langsmith
erzhuoshao
ChatPromptTemplate
