A free lunch? Wouldn't that be nice! Sometimes the quantization process improves the accuracy a little (probably by implicit regularization) but a model that's at or near capacity (as it should be) is necessarily hurt by throwing away most of the information. Language models often quantize well to small fixed-point types like int4, but it's not a magic wand.
I didn’t suggest a free lunch, just that the 8x reduction in RAM (+ faster processing) does not result in an 8x growth in the error. Thus a quantized model will outperform a non-quantized one on a evaluation/RAM metric.
Many applications dont want to host inference on the cloud and would ideally run things locally. Hardware constraints is clearly important.
Id actually say its the most important metric for most open models now, since the price per performance of closed cloud models is so competitive with open cloud models, so edge inference that is competitive is a clear value add
It's not that memory usage isn't important, it's that dividing error by memory gives you a useless number. The benefit from incremental error decrease is highly nonlinear, as with memory. Improving error by 1% matters a lot more starting from 10% error than 80%. Also a model that used no memory and got everything wrong would have the best score.
I see, I agree with you. But I would imagine the useful metric to be “error rate below X GB memory”. We really just need memory and/or compute reported when these evaluations are performed to compile that. People do it for training reports since compute and memory is implicit based on training time (since people saturate it and report what hardware they’re using). But for inference no such details :\
I find that q6 and 5+ are subjectively as good as raw tensor files. 4 bit quality reduction is very detectable though. Of course there must be a loss of information, but perhaps there is a noise floor or something like that.
At what parameter count? Its been established that quantization has less of an effect on larger models. By the time you are at 70B quantization to 4 bits basically is negligible
I work mostly with mixtral and mistral 7b these days, but I did work with some 70b models before mistral came out, and I was not impressed with the 4 bit Llama-2 70b.
Good reference. I actually work on this stuff day-to-day which is why I feel qualified to comment on it, though mostly on images rather than natural language. I'll say in my defense that work like this is why I put a little disclaimer. It's well-known that plenty of popular models quantize/prune/sparsify well for some tasks. As the authors propose "current pretraining methods are not properly leveraging the parameters in the deeper layers of the network", this is what I was referring to as the networks not being "at capacity".