Hahah, thanks! It was a marathon to get develop this, and I'm glad it reached the front page.
The name was proposed by chatgpt :) It claims it doesn't recognise this approach - so there is a chance it's really a new thing.
I want to reach out to llama.cpp and the others - I hope it gets implemented. I considered just writing a patch to llama, but c++ and the scale of that project was beyond me.
As for CPU inference - it should speed it up just as well. But thanks to the fact that it can load up a fraction of weights (e.g. just 70%, skipping the least important ones), it should be possible now to run models on less VRAM than before (still, Q8 needs to implemented though).
Funnily - when I tried comparing benchmarks to llama.cpp, I couldn't find speeds for 7B/FP16 on MB Air 16GB, because it's impossible to run with regular methods. It is possible with Effort.
Ditto, I was running full resolution, but cropped, Mixtral on my 96GB M2, even though it usually takes 114GB ram. I just loaded 75% of weights, and it was working smoothly. (before I messed something up with implementation and it now produces crap output - needs a fix)
Implementing this approach could significantly enhance the adoption of LLMs within mobile phone libraries and other compact devices. I highly recommend opening an improvement issue for llama.cpp.
The name was proposed by chatgpt :) It claims it doesn't recognise this approach - so there is a chance it's really a new thing.
I want to reach out to llama.cpp and the others - I hope it gets implemented. I considered just writing a patch to llama, but c++ and the scale of that project was beyond me.
As for CPU inference - it should speed it up just as well. But thanks to the fact that it can load up a fraction of weights (e.g. just 70%, skipping the least important ones), it should be possible now to run models on less VRAM than before (still, Q8 needs to implemented though).
Funnily - when I tried comparing benchmarks to llama.cpp, I couldn't find speeds for 7B/FP16 on MB Air 16GB, because it's impossible to run with regular methods. It is possible with Effort.
Ditto, I was running full resolution, but cropped, Mixtral on my 96GB M2, even though it usually takes 114GB ram. I just loaded 75% of weights, and it was working smoothly. (before I messed something up with implementation and it now produces crap output - needs a fix)