Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hahah, thanks! It was a marathon to get develop this, and I'm glad it reached the front page.

The name was proposed by chatgpt :) It claims it doesn't recognise this approach - so there is a chance it's really a new thing.

I want to reach out to llama.cpp and the others - I hope it gets implemented. I considered just writing a patch to llama, but c++ and the scale of that project was beyond me.

As for CPU inference - it should speed it up just as well. But thanks to the fact that it can load up a fraction of weights (e.g. just 70%, skipping the least important ones), it should be possible now to run models on less VRAM than before (still, Q8 needs to implemented though).

Funnily - when I tried comparing benchmarks to llama.cpp, I couldn't find speeds for 7B/FP16 on MB Air 16GB, because it's impossible to run with regular methods. It is possible with Effort.

Ditto, I was running full resolution, but cropped, Mixtral on my 96GB M2, even though it usually takes 114GB ram. I just loaded 75% of weights, and it was working smoothly. (before I messed something up with implementation and it now produces crap output - needs a fix)



I would imagine the importance of weights depends on the prompt. How do you decide which weights are important?


Yeah, that is the point more or less - it dynamically chise the weights layer per layer depending on the internal state.

A bit technical explaination here. https://kolinko.github.io/effort/equations.html


> It is possible with Effort.

"All things are possible with enough effort." -- Dad.


Hahaha :)


Implementing this approach could significantly enhance the adoption of LLMs within mobile phone libraries and other compact devices. I highly recommend opening an improvement issue for llama.cpp.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: