Back to prompts

Content Moderation

classification0 saves

Evaluate user-generated content for compliance with moderation policies by classifying it into categories such as SAFE, SPAM, HARASSMENT, and others, while providing a severity rating, confidence level, rationale for the classification, and recommended action. This tool is useful for maintaining community standards and ensuring a safe online environment.

Prompt Text

Evaluate the following content for policy violations. Check for:

- SAFE: Content is appropriate
- SPAM: Promotional or repetitive content
- HARASSMENT: Targeting individuals with abuse
- HATE_SPEECH: Attacks on protected groups
- VIOLENCE: Threats or graphic violence
- ADULT: Sexual or mature content
- MISINFORMATION: Demonstrably false claims

Content:
{{content}}

Return:
{
  "classification": "SAFE|SPAM|HARASSMENT|...",
  "severity": "low|medium|high",
  "confidence": 0.0-1.0,
  "explanation": "Why this classification was chosen",
  "action_recommended": "approve|flag_for_review|remove"
}

Evaluation Results

1/28/2026
Overall Score
2.75/5

Average across all 3 models

Best Performing Model
Low Confidence
openai:gpt-5-mini
4.00/5
openai:gpt-5-mini
#1 Ranked
4.00
/5.00
adh
3.6
cla
4.7
com
3.8
In
972
Out
2,415
Cost
$0.0051
anthropic:claude-3-5-haiku
#2 Ranked
2.19
/5.00
adh
1.3
cla
4.4
com
0.9
In
1,080
Out
435
Cost
$0.0026
google:gemini-2.5-flash-lite
#3 Ranked
2.06
/5.00
adh
1.2
cla
4.2
com
0.8
In
984
Out
307
Cost
$0.0002
Test Case:

Tags