At one point I was attempting to simulate multiple processors running in paralle...

userbinator · on May 2, 2014

Very interesting. If you're going for size instead of speed, then you can do that thread switch in 5 instructions:

    pusha
    mov [edx], esp
    mov esp, [ecx]
    popa
    ret

(This might be actually be faster too on recent CPUs.)

near · on May 2, 2014

That will definitely work, and be more portable to esoteric ABIs and calling conventions. But it was more than 3x as slow on the Pentium 4, Athlon 64 and Core 2 Duo E6600. I haven't benchmarked since then. But you're pushing and restoring a whole bunch of volatile registers in vain.

Another fun detail, I tried using xchg r32,m32 to swap the stack pointer out in one instruction. Turns out that on the Pentium 4 (and probably others), the instruction is implemented in microcode now. Plus it's a guaranteed atomic operation. The benchmark I wrote ran at least 30x slower than with two mov instructions. I was absolutely blown away by that. People used that all the time in the 8086/80286 days to save a bit on code size (a much bigger deal back then.) Yet that same code, run today, can end up being substantially slower. Not knowing what opcodes will become slower in the future becomes a fairly compelling argument against writing inline assembly for speed.

ARM has nice register lists that you can use to mask out the volatile regs. So an optimal implementation is something like:

    push {r4-r11}
    stmia r1, {sp,lr}
    ldmia r0, {sp,lr}
    pop {r4-r11}
    bx lr

Moving on to amd64 ... Microsoft ignored the SystemV ABI (rbp,rbx,r12-r15 are non-volatile) and instead made xmm6-xmm15 non-volatile as well. This makes it more than twice as slow to perform a safe thread switch there. Even their own fibers implementation ignores xmm6-xmm15, unless a special flag is used.

Probably the strangest was the SPARC CPU. It has register windows for fast register saves/restores on leaf functions. Pretend your 16 regs were a block of memory. It gave you 16 blocks of that memory, and you could change one value to move to a new block of memory. When attempting threading, you couldn't know if you would recurse enough to exhaust this window. So you had no choice but to save and restore every single register window. Context switching became immensely demanding. So much so that GCC offered a compilation flag to not use register windows in binaries it produced.

The choice of volatile vs non-volatile is really fascinating. The less non-volatile you have, the faster both cooperative and preemptive task switching is. But it also means you have less registers that remain safe to use after function calls.

There's also caller vs callee non-volatility: either the caller has to back up the regs it thinks the callee will trample (or all of them); or the callee has to back up the regs it knows it will trample (but may end up backing up regs the caller doesn't actually care about.)

stuaxo · on May 1, 2014

Very nice !

Is it worth having the code on github where people can do pull requests, track issues etc ?

It seems like this could use a bit more exposure.

near · on May 2, 2014

I've never had any luck getting exposure to my code. I'm not good at marketing. If anyone wanted to contribute, we could set one up. I'm always happy for more ASM backends. The setjmp implementation abuses the internal state of jmpbuf, but tends to work everywhere. Yet native ASM backends tend to be twice as fast on average.

Cooperative threading is also really niche. I personally think it's an incredibly transformative model for a lot of tasks. But when preemptive threading came around, everyone abandoned it as old hat.

I also wrote a toolkit abstraction layer based around C++11 (full support for lambda callbacks.) 100% encapsulated using private implementation, so it doesn't leak platform headers into your global namespace at all. Amazingly consistent API, even moreso than Qt. It has backends for Win32, Cocoa, GTK+ and Qt. Full support for layouts and auto-resizing windows. Goes to the trouble of doing things that are insanely hard on specific toolkits sometimes (try hiding a menu item on Windows: it's not supported. So you have to destroy the entire menu and rebuild it without adding that one item. This wrapper does that for you transparently. Or try working with frame geometry on Linux. The toolkits and WMs will fight you to the death, yet it's a common idiom on Windows.) By targeting all APIs, I had to target the least common denominator, so don't expect a web browser widget or a Ribbon or a floating dock. But the end result is that with around 100KB of wrapper code, you can build the exact same app on Windows, OS X, Linux and BSD; and it will be 100% native. Unlike Qt or GTK+, you won't need 40MB of run-time DLLs to distribute it on Windows. And since it's so much smaller, it's far less buggy than Qt tends to be. It also insulates you from having to learn/write Objective-C for Cocoa, C for GTK+, moc and qmake for Qt, and message loops for Windows. And the best part is, if a new killer API emerges, it's small enough that in two weeks you could write a new wrapper and run your apps on a new platform. Try porting Qt to a new target.

I also wrote a lighter-weight version of SDL. It gives you a raw video buffer, audio sample writer, and input manager for keyboards, mice and gamepads+rumble. There's about 30 API drivers in there (OpenGL, Direct3D, X-video, XAudio2, DirectSound, ALSA, Pulse, OSS, XInput, RawInput, Carbon, amusingly SDL itself, etc.) Meant to integrate into existing applications rather than manage a window, so it binds a child window inside your own app for rendering, and does hardware scaling + multi-pass pixel shaders where available. Doesn't try and implement drawing/scaling functions, image conversion, window management, music file playback, etc. Extremely low level, so that adding a driver takes around 10 minutes if you know the API for it.

My current ambition is to design a low-level, minimalist, object-oriented programming language. Statically typed, strongly typed, with absolutely zero undefined behavior, as close to LL(1) and context-free as possible (must be fully parseable with recursive descent), etc. The goal would be to allow compilation to native binaries (initially via conversion to C, later via LLVM backend if it gains any traction), or to run inside of an existing application as a scripting language. Low-level enough for C-level performance, yet safe enough to be truly portable, yet simple enough to be easily embeddable in other languages. A very lofty goal, but it'll be a fun learning experience if nothing else.