Having used CSR it's not surprising, and some newer formats might have more mechanical sympathy like block ELL, since they avoid uncoalesced reads / gathers, tho the code is trickier.
Oh, nice to finally bump into someone who has experience with CSR!
bucketMul has few uncoalesced reads, and it uses a different data structure than the regular CSR - it's decribed here: https://kolinko.github.io/effort/bucketmul.html It splits each Matrix row into 16 parts, and chooses which ones are necessary to read. The writes are fully linear.
Not sure if I speak sense though, it's getting a bit late today, and it's been a long day ;)
I looked into it at the beginning, but as far as I understand, the modern models like Mistral are difficult to do LoRA on - you can use it to finetune, but the model itself doesn't lend itself to such an operation.
I'm still quite new to the field, so I'd appreciate some more insights into this, and a correction.