It's not even close to a 45B model. They trained 8 different fine-tunes on the same base model. This means the 8 models differ only by a couple of layers and share the rest of their layers.
Which also means you can fit the 8 models in a much smaller amount of memory than a 45B model. Latency will also be much smaller than a 45B model, since the next token is always only created by 2 of the 8 models (which 2 models are run is chosen by a different, even smaller/faster, model).
> It's not even close to a 45B model. They trained 8 different fine-tunes on the same base model. This means the 8 models differ only by a couple of layers and share the rest of their layers.
No, Mixture-of-Experts is not stacking finetunes of the same base model.
The original paper by Shazeer suffices. What you are saying is in theory possible to do and may have been done in practice here, but in the general case MoE is trained from scratch and specializations of layers which develop are not products of some design choice.
Which also means you can fit the 8 models in a much smaller amount of memory than a 45B model. Latency will also be much smaller than a 45B model, since the next token is always only created by 2 of the 8 models (which 2 models are run is chosen by a different, even smaller/faster, model).