Tất cả bài viết

// Popular Articles

#rlhf

#4452025-10-05

Perplexity hậu-huấn luyện Qwen3.5 bằng SFT+RL: vượt GPT-5.4 trên FRAMES với chi phí rẻ hơn 4 lần

Perplexity công bố pipeline hậu-huấn luyện hai giai đoạn (SFT → GRPO) cho các mô hình search-augmented. Dựa trên Qwen3.5-397B-A17B, bản SFT-RL đạt 73.9% FRAMES ở ngân sách 4 tool call, vượt GPT-5.4 (67.8%) và Sonnet 4.6 (62.4%) với chi phí chỉ 2.0 cent/truy vấn — rẻ hơn 4× đến 7.5×.

perplexityqwen3-5post-training

7 phút đọc

#1682025-05-19

The LLM Judge Goes Soft: A Single Sentence Breaks 2 Years of AI Safety Evals

Researchers changed one sentence in the system prompt — telling the judge model its verdict could retrain or shut down the model being judged. Unsafe-content detection dropped 30%. The text being evaluated never changed. Every RLHF reward model, every leaderboard, every safety scorecard shipped since 2024 was built on this assumption.

llm-as-judgeai-safetyrlhf

6 phút đọc

#942025-04-12

Phantom Clipping: Why Your RLHF Run Stalls When Trainer Is FP32 and vLLM Is BF16

Hugging Face's TRL team finally pinpointed a long-suspected RLHF failure mode. It is not noise. It is PPO's clip silently zeroing out 18% of tokens because the trainer and the inference engine disagree at the bit level.

rlhftrlppo

8 phút đọc

#482025-03-20

15 LLM Fine-Tuning Techniques Mọi Practitioner Nên Biết (LoRA, DPO, GRPO & Co.)

Từ LoRA tiết kiệm bộ nhớ tới GRPO — động cơ huấn luyện DeepSeek-R1: bản đồ 15 kỹ thuật fine-tune LLM, chia theo 4 họ, khi nào dùng cái gì, và vì sao DPO đang là default alignment 2026.

llm-fine-tuningloraqlora

8 phút đọc