Tất cả bài viết

// Popular Articles

#speculative-decoding

#7542026-03-08

Qwen3.6 35B chạy 164 tok/s trên creative writing với DFlash: kỷ lục mới của open-source MoE

Elliot Arledge công bố benchmark single-stream: Qwen3.6-35B-A3B (3B active) + DFlash drafter ở c=1 đạt 164 tokens/sec decode trên prompt creative writing — vượt xa con số 60-90 tok/s mà DGX Spark báo cáo, cho thấy combo MoE sparse + block-diffusion speculative decoding đang mở ra một trần tốc độ mới cho LLM 35B chạy local.

qwen3-6dflashspeculative-decoding

7 phút đọc

#5722025-12-08

DFlash đã chạy được trên llama.cpp: block-diffusion draft, speedup tới 8× cho Qwen3

spiritbuun vừa push bản triển khai DFlash — speculative decoding kiểu block-diffusion — vào fork buun-llama-cpp. Một dòng lệnh --spec-type dflash, draft model 5 layer, block 16 token mỗi forward pass, tốc độ gấp 6–8 lần so với decode thường và hơn EAGLE-3 khoảng 2.5×.

dflashllama-cppspeculative-decoding

6 phút đọc

#4532025-10-09

Kimi K2.6 + DFlash trên 8x MI300X: 508 tok/s, nhanh gấp 5.6 lần mà không mất chất lượng

HotAisle vừa công bố công thức serving production cho Kimi K2.6 (1T params) trên một node 8x AMD Instinct MI300X. Chuyển từ autoregressive sang DFlash speculative decoding đẩy throughput từ 90 tok/s lên 508 tok/s — cùng phần cứng, cùng model, output bit-identical.

kimi-k2-6dflashmi300x

7 phút đọc

#4422025-10-04

200 tok/s, 49W: Qwen3.6-27B-FP8 Runs Flagship Coding on a Single DGX Spark

A day after Alibaba shipped Qwen3.6-27B, engineer Mitko Vasilev posted a number that should make every indie AI builder look twice: 200 tokens/sec peak, 136 tok/s average, 256k context, 10 concurrent agents — on one NVIDIA GB10 drawing just 49 watts. Here is what the stack is doing and why the tok/s-per-watt curve just bent.

qwen3-6dgx-sparkgb10

6 phút đọc

#2592025-07-04

DFlash cho Qwen3.6-35B-A3B chính thức GA: speculative decoding 2.9× nhanh hơn, drafter chỉ 0.5B tham số

Z Lab vừa release bản final DFlash drafter cho Qwen3.6-35B-A3B — block diffusion 0.5B params đạt 2.9× speedup trên Math500, vượt EAGLE-3 hơn 2.5×. Cộng đồng đã chạy preview từ trước khi training xong, giờ weights chính thức finalized.

dflashqwen3-6speculative-decoding

7 phút đọc

#1222025-04-26

FlashDrive: Reasoning VLA cho xe tự lái chạy real-time — 716ms xuống 159ms, zero accuracy loss

Z Lab vừa công bố FlashDrive, framework co-design kéo latency Vision-Language-Action model từ 716ms xuống 159ms trên RTX PRO 6000 (tối đa 5.7× trên RTX 4090), giữ nguyên accuracy. Bốn kỹ thuật ghép lại: streaming inference, DFlash speculative reasoning, adaptive-step flow matching, ParoQuant W4A8.

flashdrivevision-language-actionautonomous-driving

7 phút đọc