I am a bit confused here. Threads should already have separate stacks don't they?
As for the calling a function that does the yielding for you, people tend to call that "stackful" coroutines. The biggest obstacle with doing stackful coroutines is that you need to have separate stacks for the coroutines. In particular, your coroutines cannot use the main stack. This imposes some restrictions in how the runtime is implemented.
You might want to check out the "Revisiting Coroutines" paper from the people who implemented coroutines in Lua. It gives a good overview in the different kinds of coroutines: http://www.inf.puc-rio.br/~roberto/cvpub.html
Actually if you are mindful of saving and restoring the right state at the right points, and restrict the "yield usage level", you don't need anything more than one stack that all the coroutines share at different times. I've written code (parser-related) that involves a few coroutines manipulating some shared data on the stack ("shared local variables", if that makes any sense). This sort of thing is almost impossible to implement (efficiently) in an HLL, but pretty easy in Asm.
If you wanted two separate threads to share data on the stack of one thread, why not just pass the second thread a pointer to the first thread's actual stack data, instead of relying on stack-relative addressing?
For your ASM example, instead of relying on RSP, keep the first thread's stack address in RBP during execution of the second thread. Then the second thread can have its own entirely separate stack via its own RSP value.
> why not just pass the second thread a pointer to the first thread's actual stack data, instead of relying on stack-relative addressing?
That would probably be the only way to do it in a high-level language, but otherwise there really isn't a need to, just like I didn't see a need to setup another stack. I also was not thinking in terms of threads - it was like "swap between these blocks of code", and there were more than 2 of them. Being able to think about and do things like this is why I'm using Asm in the first place - I could restrict myself to writing code using the more limiting conventions of HLLs, almost like a compiler would generate, but I feel that rather defeats the point of using Asm (why not just use a compiler?) Not to say that I don't appreciate using a thread abstraction when it's appropriate, but at this level there are no threads, no functions, no strict relationships between calls and returns, nothing other than a series of instructions.
The "fast-stack-switch" using registers is also something I remember doing before, though it wasn't on x86.
> Threads should already have separate stacks don't they?
There are two type of threads: cooperative and preemptive/parallel. Coroutines with stacks represent cooperative threading. There's only ever one actual hardware thread, so there's never any race conditions or locks. Depending on whether you have one core that has an interrupt-driven kernel switching contexts for you (preemptive), or a multi-core system that's actually running two threads at the same time; the latter model protects you against a full application/system stall if one thread stops responding (hi, OS9), and it also allows scalability for parallel tasks.
> As for the calling a function that does the yielding for you, people tend to call that "stackful" coroutines.
People have a million names for it, unfortunately. It's a simple yet immensely powerful concept, although it's not popular. So people keep reinventing it (due to not knowing it exists already) and coming up with their own names for it. You'll also hear them called cothreads, fibers, green threads, etc.
But at the end of the day, they're all variant names for cooperative threads.
> This imposes some restrictions in how the runtime is implemented.
Getting access to memory for a new stack isn't a problem in practice. Outside of tiny DSPs with dedicated call stacks (eg NEC 7725), I've yet to encounter a processor where you couldn't just allocate a new block of heap memory and use it for another thread's stack space. x86, amd64, ppc32, ppc64, arm, mips, sparc, etc.
Now you can claim that the new memory won't automatically grow at exhuastion. Well, you can mmap things and catch exceptions to expand it. But really, the default these days is 512K - 1M for your main stack thread anyway in C; and they generally only reserve a max of 8M short of special compiler flags. Just allocate a large block for your extra threads and you'll be fine.
I've also heard of some tricks with setjmp/longjmp that involve subdividing the main stack, but there's really no need for that at all.
A runtime can't safely use purely stack-relative addressing anyway, since there's no telling how much of an offset there is due to non-runtime function recursion already when the runtime functions are invoked. Unless it's a very high-level language, in which case there's a myriad of better ways to handle this anyway.
As for the calling a function that does the yielding for you, people tend to call that "stackful" coroutines. The biggest obstacle with doing stackful coroutines is that you need to have separate stacks for the coroutines. In particular, your coroutines cannot use the main stack. This imposes some restrictions in how the runtime is implemented.
You might want to check out the "Revisiting Coroutines" paper from the people who implemented coroutines in Lua. It gives a good overview in the different kinds of coroutines: http://www.inf.puc-rio.br/~roberto/cvpub.html