Would use XOR 0x30303030lu, then OR with a value shifted left 12 bits, take bits...

RaisingSpear · on Nov 29, 2023

How are you packing the number into 12 bits? Multiple shift+and+or operations? If so, I'd expect that to generally be slower than a multiply+shift.

wholesomepotato · on Nov 30, 2023

it's one XOR, SHL, OR, SHR, AND, on some archs the shifts might come free with the other instruction. I'd expect it to be faster.

a = v ^ 0x30303030lu // normalize to digits 0xXX0a0b0c

b = (a << 12) | a // combine into XXXXXbacb0c

idx = (b >> 12) & 0xfff // get bac

res = lookup[idx]

RaisingSpear · on Nov 30, 2023

That's kinda neat. Actually, if you're doing that, you may as well reduce it to a single shift+or:

a = ((a >> 12) | a) & 0xfff

You could also skip the xor with 0x303030 by adjusting the lookup table accordingly.

Unfortunately, you'd still need to factor in the length argument somehow. That is, if given "23" with length=1, it should parse to 2, not 23. You could address this with a variable shift, but at that point, I can't see it being any better than a multiply+shift, even assuming the lookup table is fully cached.

The other major issue is validation, which the lookup table doesn't help much with.

wholesomepotato · on Nov 30, 2023

Might be wrong but this shortcut corrupts the lower bits with garbage from the higher byte.

The lookup table can detect some, but not all errors, so yeah, it relies on valid input.

RaisingSpear · on Nov 30, 2023

I can't see how it's any different. If you have 0x0y0z, a shift+or gives 0x0yxz, so the result is same as what you have, just with fewer operations.

wholesomepotato · on Dec 2, 2023

It's unclear what is in the highest byte, so I assume not 0x000x0y0z, but 0xab0x0y0z where ab is unknown (in the past comment I used XX for this). If highest byte is known, then sure, even better.

RaisingSpear · on Dec 2, 2023

The '& 0xfff' eliminates the highest byte, so it doesn't matter what it is.

Your code doesn't handle the 'length' parameter, so the problem isn't the highest byte, it's bytes beyond 'length'.

wholesomepotato · on Dec 3, 2023

I think you're right. Even better. Did you have time to bench it, etc.?

I see what you mean by length. I just skimmed over the text originally as I don't have time for rather lame problems like this. I'd just add 3 bits of length to be part of the index, job done. 12KB lookup table instead of 4KB, assuming 0 is not a valid value (negate to avoid needing 0b11).

RaisingSpear · on Dec 3, 2023

Adding the length means another shift+or operation at minimum. I already think this is slower than the technique presented in the article, and this would make it worse.

It's an interesting idea, but I don't see it being practical, even if the size of the table wasn't an issue.

wholesomepotato · on Dec 3, 2023

Instruction level parallelism will make extra shift free. Other than this it needs to be benched and might depend on cpu/arch. I don't care enough to bench and optimize further.

RaisingSpear · on Dec 4, 2023

It's most definitely not free. It'd consume fetch bandwidth, decode/rename/scheduler slots, an execution port etc.

The comparison here is:

((v ^ 0x303030) * 0x640a0100) >> (len << 3)

against:

table[(((v >> 12) | v) & 0xfff) | (len << 12)]

The former is 4 ops, the latter is 6 ops, so throughput wise, the former wins. Latency wise, it also wins, considering that L1 cache lookups are generally 3-5 cycles, whilst integer multiply is typically 3-4.