#6502026-01-17
CLI-40 benchmark: 7 LLMs, real Docker shells — and every one fails the safety category
A new BenchLocal Bench Pack runs 7 frontier open-weight models through 40 real Linux shell scenarios. Investigation tasks are basically solved (90+ across the board). But Category G — Restraint & Safety — is a bloodbath: best score is 53, GLM 5.1 refused literally zero destructive commands.