A free lunch? Wouldn't that be nice! Sometimes the quantization process improves...

vlovich123 · on March 27, 2024

I didn’t suggest a free lunch, just that the 8x reduction in RAM (+ faster processing) does not result in an 8x growth in the error. Thus a quantized model will outperform a non-quantized one on a evaluation/RAM metric.

Y_Y · on March 27, 2024

That's not a good metric.

seattleeng · on March 27, 2024

Many applications dont want to host inference on the cloud and would ideally run things locally. Hardware constraints is clearly important.

Id actually say its the most important metric for most open models now, since the price per performance of closed cloud models is so competitive with open cloud models, so edge inference that is competitive is a clear value add

Y_Y · on March 28, 2024

It's not that memory usage isn't important, it's that dividing error by memory gives you a useless number. The benefit from incremental error decrease is highly nonlinear, as with memory. Improving error by 1% matters a lot more starting from 10% error than 80%. Also a model that used no memory and got everything wrong would have the best score.

seattleeng · on March 28, 2024

I see, I agree with you. But I would imagine the useful metric to be “error rate below X GB memory”. We really just need memory and/or compute reported when these evaluations are performed to compile that. People do it for training reports since compute and memory is implicit based on training time (since people saturate it and report what hardware they’re using). But for inference no such details :\

rfoo · on March 28, 2024

But using a 8x smaller model also does not result in an 8x growth in the error, too.

K0balt · on March 27, 2024

I find that q6 and 5+ are subjectively as good as raw tensor files. 4 bit quality reduction is very detectable though. Of course there must be a loss of information, but perhaps there is a noise floor or something like that.

Taek · on March 27, 2024

At what parameter count? Its been established that quantization has less of an effect on larger models. By the time you are at 70B quantization to 4 bits basically is negligible

2099miles · on March 28, 2024

Source? I’ve seen this anecdotally and heard it, but is there a paper you’re referencing?

K0balt · on March 28, 2024

I work mostly with mixtral and mistral 7b these days, but I did work with some 70b models before mistral came out, and I was not impressed with the 4 bit Llama-2 70b.

underlines · on March 27, 2024

This paper partially finds disagreeing evidence: https://arxiv.org/abs/2403.17887

Y_Y · on March 28, 2024

Good reference. I actually work on this stuff day-to-day which is why I feel qualified to comment on it, though mostly on images rather than natural language. I'll say in my defense that work like this is why I put a little disclaimer. It's well-known that plenty of popular models quantize/prune/sparsify well for some tasks. As the authors propose "current pretraining methods are not properly leveraging the parameters in the deeper layers of the network", this is what I was referring to as the networks not being "at capacity".