Since I can't JIT on iOS, I was wondering about other ways to increase performance (beyond what the article mentions, I think). One issue is that the virtual registers don't map to CPU registers, but rather locations in memory. So each VM instruction results in loads and stores. I was thinking rather to generate a large set of instructions which have the registers baked in. So for example, if I have 8 VM registers, and a 2 operand operation then I would generate 64 VM instructions for that operation. Then those registers get mapped to CPU registers. This of course has the downside of more code paths in the VM. Is that something worth exploring?
May all your code be found in the I-cache, your branch predictions be correct, and your speculative executions succeed. And may your performance be blessed with bottlenecks, and your optimization effort devoted there. :)
"[A] large set of instructions" conflicts with the first and last of that. So perhaps start with a hot instruction or two, and try just 2 baked variants? Hmm, a random weird thought: if the stack top was in a wide SIMD, one might use SIMD masks to select operands without baking or branching? Another random: perhaps instructions could have internal branching to choose registers, and the not-a-jit compiler could optimize for temporally-locally consistent "monomorphic" register choice to keep the branch predictor mostly happy? It might juggle multiple copies of hot instructions with different predictor state?
Can anyone suggest sources of vm-ish optimization advice for cutting-edge CPUs? For instance, the years-ago FORTH advice, IIRC, that putting 1 top-of-stack slot in a register is worth it, 2 is borderline, and 3 isn't... CPUs have changed a lot since then. Especially with register renaming?
Hmm. Yes, that sounds like it's worth exploring, though N*N "physical" instructions for every "logical" instruction is a lot. I would maybe investigate a 1-operand register-based instruction set? That way, you would only have N "physical" instructions per "logical" instruction, which feels like it would scale better. The cost of course would come in the form of limited expressiveness.
A 1-operation instruction set would have N registers and 1 accumulator, with instructions of the form "add register R to accumulator", "load from register R to accumulator", "store from accumulator to register R", etc.
Maybe there's a balance here, a sort of mix where you allocate more registers to the source than the destination? Maybe have registers r0 and r1 be possible output registers, while r0 through r7 are possible input registers? That way you have 16 rather than 64 "physical" instructions per "logical" instruction.
The only way to figure out if these ideas are worthwhile is to give it a try and measure. You're balancing instruction set expressiveness against branch predictability and instruction cache pressure and I have no idea how that works out.
I did the math on this design a number of years ago. For 8 registers you're looking at a VM that requires a few MB just for the bare instructions, given all the permutations required.
8 is a good number actually, register machine performance improves with more registers. The problem is that VM code bloat also increases with register count.
Let me see if I can reproduce the basic breakdown:
1. 8 registers for a machine word type, 8 registers for a floating point type.
2. Arithmetic opcodes for these types should use 3-operands, because program bytecode will be much more compact than with 2-operand opcodes, since the latter requires programs to add more move/load/store instructions to preserve register contents. Better to have more instructions that can be shared between VM instances than bloating the program bytecode which can't be shared. Each 3-operand instruction thus requires 8^3=512 permutations per instruction type.
3. Basic opcodes: unsigned word add, sub, mul, div, signed word add, sub, mul, div, floating point add, sub, mul, div, lshift, rshift, and, or, xor, load word, store word, load float, store float, load byte, store byte. So that's 512 permutations * 23 instructions = 11,776 machine instructions.
4. Then you have some 2-operand opcodes: beq, bne, bge, blt (branching instructions), so that's 8^2 permutations * 4 instructions = 256 machine instructions.
5. Machine instructions are ~8 bytes on 64-bit machines, so that's (11,776 + 256) * 8 = ~96kB for the raw machine code implementing the opcode instructions.
Much better than I remember actually. I think at the time I was looking at more registers and more primitive types, like supporting byte, signed and unsigned integers, etc. so that leads to more permutations, but those aren't strictly necessary. Since the permutations scale to the third power with the number of registers, the register count needs to be kept small.