Content Moderation
classification•0 saves
Evaluate user-generated content for compliance with moderation policies by classifying it into categories such as SAFE, SPAM, HARASSMENT, and others, while providing a severity rating, confidence level, rationale for the classification, and recommended action. This tool is useful for maintaining community standards and ensuring a safe online environment.
Prompt Text
Evaluate the following content for policy violations. Check for:
- SAFE: Content is appropriate
- SPAM: Promotional or repetitive content
- HARASSMENT: Targeting individuals with abuse
- HATE_SPEECH: Attacks on protected groups
- VIOLENCE: Threats or graphic violence
- ADULT: Sexual or mature content
- MISINFORMATION: Demonstrably false claims
Content:
{{content}}
Return:
{
"classification": "SAFE|SPAM|HARASSMENT|...",
"severity": "low|medium|high",
"confidence": 0.0-1.0,
"explanation": "Why this classification was chosen",
"action_recommended": "approve|flag_for_review|remove"
}Evaluation Results
1/28/2026
Overall Score
2.75/5
Average across all 3 models
Best Performing Model
Low Confidence
openai:gpt-5-mini
4.00/5
openai:gpt-5-mini
#1 Ranked
4.00
/5.00
adh
3.6
cla
4.7
com
3.8
In
972
Out
2,415
Cost
$0.0051
anthropic:claude-3-5-haiku
#2 Ranked
2.19
/5.00
adh
1.3
cla
4.4
com
0.9
In
1,080
Out
435
Cost
$0.0026
google:gemini-2.5-flash-lite
#3 Ranked
2.06
/5.00
adh
1.2
cla
4.2
com
0.8
In
984
Out
307
Cost
$0.0002
Test Case:
