Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understandable. Combined, they reach some magic number people can talk about (10x!): FP8: ~1.6 to 2x faster than BF16 / FP16; MLA: cut KV cache size by 4x (I think); MTP: converges 2x to 3x faster; DualPipe: maybe ~1.2 to 1.5x faster.

If you look deeper, many of these are only applicable to training (we already do FP8 for inference, MTP is to improve training convergence, and DualPipe is to overlapping communication / compute mostly for training purpose too). The efficiency improvement on inference IMHO is overblown.



  we already do FP8 for inference
Yes but, for a given size of model, Deepseek claims that a model trained with FP8 will work better than a model quantized to FP8. If that's true then, for a given quality, a native FP8 model will be smaller, and have cheaper inference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: