Tất cả bài viết

// Popular Articles

#fp8

#4422025-10-04

200 tok/s, 49W: Qwen3.6-27B-FP8 Runs Flagship Coding on a Single DGX Spark

A day after Alibaba shipped Qwen3.6-27B, engineer Mitko Vasilev posted a number that should make every indie AI builder look twice: 200 tokens/sec peak, 136 tok/s average, 256k context, 10 concurrent agents — on one NVIDIA GB10 drawing just 49 watts. Here is what the stack is doing and why the tok/s-per-watt curve just bent.

qwen3-6dgx-sparkgb10

6 phút đọc

#402025-03-16

DeepSeek Mega MoE: viết lại cách Mixture-of-Experts chạy trên GPU

Ngày 16/04/2026, DeepSeek tung bản DeepGEMM mới gom toàn bộ forward path của MoE — dispatch, linear1, SwiGLU, linear2, combine — vào một mega-kernel duy nhất, chồng NVLink traffic lên Tensor Core MMAs. Không còn chuỗi compute–wait–transfer. GPU đỡ idle, scaling multi-GPU MoE sạch hơn hẳn.

deepseekdeepgemmmixture-of-experts

7 phút đọc