Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understa... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		liuliu on Jan 27, 2025 \| parent \| context \| favorite \| on: The impact of competition and DeepSeek on Nvidia Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understandable. Combined, they reach some magic number people can talk about (10x!): FP8: ~1.6 to 2x faster than BF16 / FP16; MLA: cut KV cache size by 4x (I think); MTP: converges 2x to 3x faster; DualPipe: maybe ~1.2 to 1.5x faster. If you look deeper, many of these are only applicable to training (we already do FP8 for inference, MTP is to improve training convergence, and DualPipe is to overlapping communication / compute mostly for training purpose too). The efficiency improvement on inference IMHO is overblown.

rahimnathwani on Jan 28, 2025 [–]

  we already do FP8 for inference

Yes but, for a given size of model, Deepseek claims that a model trained with FP8 will work better than a model quantized to FP8. If that's true then, for a given quality, a native FP8 model will be smaller, and have cheaper inference.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact