Tất cả bài viết

// Popular Articles

#ppo

#7822026-03-20

AVB drops a 50-minute GRPO + RLVR deep dive — and you watch logits move in real time

Avishek Biswas (@neural_avb) shipped a 50-minute long-form tutorial that walks through GRPO low-level mechanics, trains sub-1B SmolLM and Qwen3 models on text-based RLVR gym envs, and animates PPO updates so you literally see the policy logits shift. Code included.

grporlvrreinforcement-learning

7 phút đọc

#942025-04-12

Phantom Clipping: Why Your RLHF Run Stalls When Trainer Is FP32 and vLLM Is BF16

Hugging Face's TRL team finally pinpointed a long-suspected RLHF failure mode. It is not noise. It is PPO's clip silently zeroing out 18% of tokens because the trainer and the inference engine disagree at the bit level.

rlhftrlppo

8 phút đọc