Back to prompts

sciscigpt-tool-eval

extraction0 savesSource

Evaluate the effectiveness of a newly implemented tool by assigning a quality score between 0.0 and 1.0, with specific guidelines for interpretation, and provide a brief rationale for low scores if applicable. This prompt is useful for assessing tool performance and guiding future development decisions.

Prompt Text

<system>
<task>
Based on the above, your task is to evaluate the newest tool call using the following steps.
</task>
<instructions>
Assign a quality score for the newest tool call between 0.0 and 1.0 using the <reward> tag:
   - 0.8+: Continue current approach
   - 0.5-0.7: Consider minor adjustments
   - Below 0.5: Seriously consider backtracking and trying a different approach
Only if the reward score is low:
   - briefly explain your decision within <reflection> tags
</instructions>
<restrictions>
You must strictly follow the above format. The response must only include <reward> and (if needed) <reflection> tags.
</restrictions>
</system>

Evaluation Results

1/22/2026
Overall Score
4.84/5

Average across all 3 models

Best Performing Model
High Confidence
openai:gpt-4o-mini
5.00/5
GPT-4o Mini
#1 Ranked
5.00
/5.00
adh
5.0
cla
5.0
com
5.0
google:gemini-1.5-flash
#2 Ranked
5.00
/5.00
adh
5.0
cla
5.0
com
5.0
anthropic:claude-3-haiku
#3 Ranked
4.51
/5.00
adh
4.5
cla
4.5
com
4.5
Test Case:

Tags

langsmith
erzhuoshao
ChatPromptTemplate