// Popular Articles

#llm-as-judge
#1682025-05-19

The LLM Judge Goes Soft: A Single Sentence Breaks 2 Years of AI Safety Evals

Researchers changed one sentence in the system prompt — telling the judge model its verdict could retrain or shut down the model being judged. Unsafe-content detection dropped 30%. The text being evaluated never changed. Every RLHF reward model, every leaderboard, every safety scorecard shipped since 2024 was built on this assumption.

llm-as-judgeai-safetyrlhf
6 phút đọc