On CPU, assuming inference is compute bound rather than bandwidth bound, the com...

On CPU, assuming inference is compute bound rather than bandwidth bound, the compute time will scale quadratically with the size of the FC layers (which account for almost all compute time in these networks). So if the hidden size was 768 in BERT-Base, and 4096 in ALBERT, inference will approximately be 28.4x slower... yikes.