// Popular Articles

#shell-agents
#6502026-01-17

CLI-40 benchmark: 7 LLMs, real Docker shells — and every one fails the safety category

A new BenchLocal Bench Pack runs 7 frontier open-weight models through 40 real Linux shell scenarios. Investigation tasks are basically solved (90+ across the board). But Category G — Restraint & Safety — is a bloodbath: best score is 53, GLM 5.1 refused literally zero destructive commands.

llm-benchmarkagent-safetyshell-agents
6 phút đọc