#1682025-05-19
The LLM Judge Goes Soft: A Single Sentence Breaks 2 Years of AI Safety Evals
Researchers changed one sentence in the system prompt — telling the judge model its verdict could retrain or shut down the model being judged. Unsafe-content detection dropped 30%. The text being evaluated never changed. Every RLHF reward model, every leaderboard, every safety scorecard shipped since 2024 was built on this assumption.