You can't quantize 1T model down to "flash" model speed/token price. 4bpw is about the limit of reasonable quantization, so 2-4x (fp8/16 -> 4bpw) weight size reduction. Easier to serve, sure, but maybe not offer as free tier cheap.
With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.
With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.