At one point I was attempting to simulate multiple processors running in parallel. It turns out this behavior can get really finely grained.
So a processor executes instructions, and an instruction/opcode is made up of several cycles. Say for an "INCrement" instruction, there would be a cycle to fetch the opcode prefix, another to fetch the location to increment, another to read the value at that address, another to increment the value, another to write the value back to that address.
There can arise a case where one CPU is writing to memory right as another is reading that same address. And depending on whether you synchronize the two CPUs between opcodes or between cycles, the one reading can end up seeing a different value. This can, in certain cases, break emulated software if not done correctly.
It gets even hairier when you factor in that each cycle consists of several clock ticks. There is time required to hold a value on a bus before a read or write is acknowledged.
So as I continued to try and increase the precision, I found myself nesting state machines within state machines. Each CPU core had one for the instruction table to select the active instruction. And then each instruction had one for the active cycle. And then each cycle had one for the active clock. And then there was also an interrupt-driven DMA unit that could trigger within cycles that had to be accounted for. It was just too complex to collapse into one giant, 100,000 case state machine inside of a single function. Imagine trying to implement an x86 CPU inside one switch table inside one function.
So you would have to traverse through 3-4 state machines to do something as trivial as "increment an integer", and then return all the way back up the stack frame and switch to the other CPU. The code ended up being around 90% state machine maintenance to 10% actual simulation code. And it was painfully slow.
This led me to the idea of having a separate cooperative thread (coroutine+stack frame) for each CPU. Whenever one would read from or write to something the other CPU could see, it would make sure it was 'behind in time' to the other CPU. Otherwise, it would switch the stack frame and resume where the other thread left off. When that CPU stopped, it would resume right at our first CPU's pause. Very highly reciprocal.
The end result was a code reduction from around 350KB for a CPU core to around 35KB for the exact same code. And thanks to what I'll call "just in time switching", it was possible to run one CPU through hundreds of instructions when it wasn't communicating with the other, greatly reducing host CPU cache thrashing. I ended up with a huge speed bonus to boot.
The one problematic area ended up being serialization. Basically, this model moves the state machine into the host CPU's stack frames. But you can't write that out portably, so it becomes a real problem to capture an exact point in your program, write it out to disk, and then resume at that exact point in a future run. That took a long time to solve, and required some serious trickery involving "checkpoints" for serialization. So it's something to keep in mind if you want to use coroutines/cooperative threads and also want to serialize/unserialize things.
What was refreshing was that the core logic for switching between two threads is surprisingly simple. For x86, it is simply:
Basically: save the old stack pointer and volatile registers, swap to the new stack pointer, restore the volatile registers and return. The function design is the program-equivalent of a palindrome.
(push/pop tends to be slower on Athlon chips than indirect memory accesses; and getting the return address into eax instead of relying on ret allows x86 CPUs to start caching the new instructions quicker.)
So, like everyone else who has discovered this concept, I of course wrote my own library for it in C89, which can be downloaded here: http://byuu.org/programming/libco/
Instead of trying like other libraries to build in complicated schedulers and hundreds of functions, mine is just four functions taking zero to two arguments each. No data structures, no #defines. The idea was for the user to write their own scheduling system that works best for their intended use case, instead of trying to be one size fits all.
That will definitely work, and be more portable to esoteric ABIs and calling conventions. But it was more than 3x as slow on the Pentium 4, Athlon 64 and Core 2 Duo E6600. I haven't benchmarked since then. But you're pushing and restoring a whole bunch of volatile registers in vain.
Another fun detail, I tried using xchg r32,m32 to swap the stack pointer out in one instruction. Turns out that on the Pentium 4 (and probably others), the instruction is implemented in microcode now. Plus it's a guaranteed atomic operation. The benchmark I wrote ran at least 30x slower than with two mov instructions. I was absolutely blown away by that. People used that all the time in the 8086/80286 days to save a bit on code size (a much bigger deal back then.) Yet that same code, run today, can end up being substantially slower. Not knowing what opcodes will become slower in the future becomes a fairly compelling argument against writing inline assembly for speed.
ARM has nice register lists that you can use to mask out the volatile regs. So an optimal implementation is something like:
push {r4-r11}
stmia r1, {sp,lr}
ldmia r0, {sp,lr}
pop {r4-r11}
bx lr
Moving on to amd64 ... Microsoft ignored the SystemV ABI (rbp,rbx,r12-r15 are non-volatile) and instead made xmm6-xmm15 non-volatile as well. This makes it more than twice as slow to perform a safe thread switch there. Even their own fibers implementation ignores xmm6-xmm15, unless a special flag is used.
Probably the strangest was the SPARC CPU. It has register windows for fast register saves/restores on leaf functions. Pretend your 16 regs were a block of memory. It gave you 16 blocks of that memory, and you could change one value to move to a new block of memory. When attempting threading, you couldn't know if you would recurse enough to exhaust this window. So you had no choice but to save and restore every single register window. Context switching became immensely demanding. So much so that GCC offered a compilation flag to not use register windows in binaries it produced.
The choice of volatile vs non-volatile is really fascinating. The less non-volatile you have, the faster both cooperative and preemptive task switching is. But it also means you have less registers that remain safe to use after function calls.
There's also caller vs callee non-volatility: either the caller has to back up the regs it thinks the callee will trample (or all of them); or the callee has to back up the regs it knows it will trample (but may end up backing up regs the caller doesn't actually care about.)
I've never had any luck getting exposure to my code. I'm not good at marketing. If anyone wanted to contribute, we could set one up. I'm always happy for more ASM backends. The setjmp implementation abuses the internal state of jmpbuf, but tends to work everywhere. Yet native ASM backends tend to be twice as fast on average.
Cooperative threading is also really niche. I personally think it's an incredibly transformative model for a lot of tasks. But when preemptive threading came around, everyone abandoned it as old hat.
I also wrote a toolkit abstraction layer based around C++11 (full support for lambda callbacks.) 100% encapsulated using private implementation, so it doesn't leak platform headers into your global namespace at all. Amazingly consistent API, even moreso than Qt. It has backends for Win32, Cocoa, GTK+ and Qt. Full support for layouts and auto-resizing windows. Goes to the trouble of doing things that are insanely hard on specific toolkits sometimes (try hiding a menu item on Windows: it's not supported. So you have to destroy the entire menu and rebuild it without adding that one item. This wrapper does that for you transparently. Or try working with frame geometry on Linux. The toolkits and WMs will fight you to the death, yet it's a common idiom on Windows.) By targeting all APIs, I had to target the least common denominator, so don't expect a web browser widget or a Ribbon or a floating dock. But the end result is that with around 100KB of wrapper code, you can build the exact same app on Windows, OS X, Linux and BSD; and it will be 100% native. Unlike Qt or GTK+, you won't need 40MB of run-time DLLs to distribute it on Windows. And since it's so much smaller, it's far less buggy than Qt tends to be. It also insulates you from having to learn/write Objective-C for Cocoa, C for GTK+, moc and qmake for Qt, and message loops for Windows. And the best part is, if a new killer API emerges, it's small enough that in two weeks you could write a new wrapper and run your apps on a new platform. Try porting Qt to a new target.
I also wrote a lighter-weight version of SDL. It gives you a raw video buffer, audio sample writer, and input manager for keyboards, mice and gamepads+rumble. There's about 30 API drivers in there (OpenGL, Direct3D, X-video, XAudio2, DirectSound, ALSA, Pulse, OSS, XInput, RawInput, Carbon, amusingly SDL itself, etc.) Meant to integrate into existing applications rather than manage a window, so it binds a child window inside your own app for rendering, and does hardware scaling + multi-pass pixel shaders where available. Doesn't try and implement drawing/scaling functions, image conversion, window management, music file playback, etc. Extremely low level, so that adding a driver takes around 10 minutes if you know the API for it.
My current ambition is to design a low-level, minimalist, object-oriented programming language. Statically typed, strongly typed, with absolutely zero undefined behavior, as close to LL(1) and context-free as possible (must be fully parseable with recursive descent), etc. The goal would be to allow compilation to native binaries (initially via conversion to C, later via LLVM backend if it gains any traction), or to run inside of an existing application as a scripting language. Low-level enough for C-level performance, yet safe enough to be truly portable, yet simple enough to be easily embeddable in other languages. A very lofty goal, but it'll be a fun learning experience if nothing else.
So a processor executes instructions, and an instruction/opcode is made up of several cycles. Say for an "INCrement" instruction, there would be a cycle to fetch the opcode prefix, another to fetch the location to increment, another to read the value at that address, another to increment the value, another to write the value back to that address.
There can arise a case where one CPU is writing to memory right as another is reading that same address. And depending on whether you synchronize the two CPUs between opcodes or between cycles, the one reading can end up seeing a different value. This can, in certain cases, break emulated software if not done correctly.
It gets even hairier when you factor in that each cycle consists of several clock ticks. There is time required to hold a value on a bus before a read or write is acknowledged.
So as I continued to try and increase the precision, I found myself nesting state machines within state machines. Each CPU core had one for the instruction table to select the active instruction. And then each instruction had one for the active cycle. And then each cycle had one for the active clock. And then there was also an interrupt-driven DMA unit that could trigger within cycles that had to be accounted for. It was just too complex to collapse into one giant, 100,000 case state machine inside of a single function. Imagine trying to implement an x86 CPU inside one switch table inside one function.
So you would have to traverse through 3-4 state machines to do something as trivial as "increment an integer", and then return all the way back up the stack frame and switch to the other CPU. The code ended up being around 90% state machine maintenance to 10% actual simulation code. And it was painfully slow.
This led me to the idea of having a separate cooperative thread (coroutine+stack frame) for each CPU. Whenever one would read from or write to something the other CPU could see, it would make sure it was 'behind in time' to the other CPU. Otherwise, it would switch the stack frame and resume where the other thread left off. When that CPU stopped, it would resume right at our first CPU's pause. Very highly reciprocal.
The end result was a code reduction from around 350KB for a CPU core to around 35KB for the exact same code. And thanks to what I'll call "just in time switching", it was possible to run one CPU through hundreds of instructions when it wasn't communicating with the other, greatly reducing host CPU cache thrashing. I ended up with a huge speed bonus to boot.
The one problematic area ended up being serialization. Basically, this model moves the state machine into the host CPU's stack frames. But you can't write that out portably, so it becomes a real problem to capture an exact point in your program, write it out to disk, and then resume at that exact point in a future run. That took a long time to solve, and required some serious trickery involving "checkpoints" for serialization. So it's something to keep in mind if you want to use coroutines/cooperative threads and also want to serialize/unserialize things.
What was refreshing was that the core logic for switching between two threads is surprisingly simple. For x86, it is simply:
Basically: save the old stack pointer and volatile registers, swap to the new stack pointer, restore the volatile registers and return. The function design is the program-equivalent of a palindrome.(push/pop tends to be slower on Athlon chips than indirect memory accesses; and getting the return address into eax instead of relying on ret allows x86 CPUs to start caching the new instructions quicker.)
So, like everyone else who has discovered this concept, I of course wrote my own library for it in C89, which can be downloaded here: http://byuu.org/programming/libco/
Instead of trying like other libraries to build in complicated schedulers and hundreds of functions, mine is just four functions taking zero to two arguments each. No data structures, no #defines. The idea was for the user to write their own scheduling system that works best for their intended use case, instead of trying to be one size fits all.