I was wondering if something similar can be done for CNN too.

kolinko · on April 18, 2024

A friend who first heard of the method immediately suggested this may work, perhaps even better in Diffusion models.

It is really a drop-in replacement for the regular matrix multiplication. The data structure is a bit more painful to work with (you have three datasets, not just one to represent a weight matrix), but it shouldn't be too difficult for devs of the existing inference engines to implement it for a test.

Half of my challenge was that I wasn't knowledgable enough to just patch llama.cpp or MLX and use their engines with bucketMul. That's why I opted for making my own - still not sure if it was a good choice to build everything from the ground up though, although I'm proud of the name :)

Finally - the basic math behind approximation suggest that this should work with all the models - cosine similarity score is 0.99 until the magical 25% mark for most of the matrixes I tried. It can vary within a model though - e.g. in Llama, on the first layer, wq/wv/wk could be easily approximated with 5% effort, whereas some deeper layers at 25% had just 0.90 cos score - still seemed enough to not lose coherency within the model.