Yeah, more tests are needed. I got some feedback on using KL instead of the token similarity - initial tests seem to show that it is workable (compared to Q8), but not awesomely amazing - will be working on that next week and publishing.
As for treating effort+Mistral as a separate model - I wouldn't do that comparison. The model stays the same, all the weights from it are still being used, just not all of the time - we don't really lose information from the source model.
As for treating effort+Mistral as a separate model - I wouldn't do that comparison. The model stays the same, all the weights from it are still being used, just not all of the time - we don't really lose information from the source model.