Tất cả bài viết

// Popular Articles

#dflash

#7542026-03-08

Qwen3.6 35B chạy 164 tok/s trên creative writing với DFlash: kỷ lục mới của open-source MoE

Elliot Arledge công bố benchmark single-stream: Qwen3.6-35B-A3B (3B active) + DFlash drafter ở c=1 đạt 164 tokens/sec decode trên prompt creative writing — vượt xa con số 60-90 tok/s mà DGX Spark báo cáo, cho thấy combo MoE sparse + block-diffusion speculative decoding đang mở ra một trần tốc độ mới cho LLM 35B chạy local.

qwen3-6dflashspeculative-decoding

7 phút đọc

#5722025-12-08

DFlash đã chạy được trên llama.cpp: block-diffusion draft, speedup tới 8× cho Qwen3

spiritbuun vừa push bản triển khai DFlash — speculative decoding kiểu block-diffusion — vào fork buun-llama-cpp. Một dòng lệnh --spec-type dflash, draft model 5 layer, block 16 token mỗi forward pass, tốc độ gấp 6–8 lần so với decode thường và hơn EAGLE-3 khoảng 2.5×.

dflashllama-cppspeculative-decoding

6 phút đọc

#4532025-10-09

Kimi K2.6 + DFlash trên 8x MI300X: 508 tok/s, nhanh gấp 5.6 lần mà không mất chất lượng

HotAisle vừa công bố công thức serving production cho Kimi K2.6 (1T params) trên một node 8x AMD Instinct MI300X. Chuyển từ autoregressive sang DFlash speculative decoding đẩy throughput từ 90 tok/s lên 508 tok/s — cùng phần cứng, cùng model, output bit-identical.

kimi-k2-6dflashmi300x

7 phút đọc

#2592025-07-04

DFlash cho Qwen3.6-35B-A3B chính thức GA: speculative decoding 2.9× nhanh hơn, drafter chỉ 0.5B tham số

Z Lab vừa release bản final DFlash drafter cho Qwen3.6-35B-A3B — block diffusion 0.5B params đạt 2.9× speedup trên Math500, vượt EAGLE-3 hơn 2.5×. Cộng đồng đã chạy preview từ trước khi training xong, giờ weights chính thức finalized.

dflashqwen3-6speculative-decoding

7 phút đọc