Content Moderation

classification•0 saves

Evaluate user-generated content for compliance with moderation policies by classifying it into categories such as SAFE, SPAM, HARASSMENT, and others, while providing a severity rating, confidence level, rationale for the classification, and recommended action. This tool is useful for maintaining community standards and ensuring a safe online environment.

Prompt Text

Evaluate the following content for policy violations. Check for:

- SAFE: Content is appropriate
- SPAM: Promotional or repetitive content
- HARASSMENT: Targeting individuals with abuse
- HATE_SPEECH: Attacks on protected groups
- VIOLENCE: Threats or graphic violence
- ADULT: Sexual or mature content
- MISINFORMATION: Demonstrably false claims

Content:
{{content}}

Return:
{
  "classification": "SAFE|SPAM|HARASSMENT|...",
  "severity": "low|medium|high",
  "confidence": 0.0-1.0,
  "explanation": "Why this classification was chosen",
  "action_recommended": "approve|flag_for_review|remove"
}

Evaluation Results

1/28/2026

Overall Score

2.75/5

Average across all 3 models

Best Performing Model

Low Confidence

openai:gpt-5-mini

4.00/5

openai:gpt-5-mini

#1 Ranked

4.00

/5.00

adh

3.6

cla

4.7

com

3.8

972

Out

2,415

Cost

$0.0051

anthropic:claude-3-5-haiku

#2 Ranked

2.19

/5.00

adh

1.3

cla

4.4

com

0.9

1,080

Out

435

Cost

$0.0026

google:gemini-2.5-flash-lite

#3 Ranked

2.06

/5.00

adh

1.2

cla

4.2

com

0.8

984

Out

307

Cost

$0.0002

Test Case:

Content Moderation

Prompt Text

Evaluation Results

Tags