TCO isn't about fast performance, it's about stack depth.
I like Quasar, but I suspect you removed reduction scheduling because function-call-site reductions, as erlang does natively, is not great when implemented in-language, as it blows up a number of optimizations and muddies the register picture.
That said, I think MCRed is doing a slight disservice even he's got good intentions; erlang is amazing for what it does, quasar is pretty nifty and doubtless has many superior use cases; even go has a lot of utility in the concurrency space. Let a billion processes bloom.
> TCO isn't about fast performance, it's about stack depth.
Sure, but again, it's certainly doable (and has been done).
> I suspect you removed reduction scheduling because function-call-site reductions, as erlang does natively, is not great when implemented in-language, as it blows up a number of optimizations and muddies the register picture.
We removed it because it's unhelpful, and let me explain why. Any preemption that's based on time-slices/reduction -- basically, preemption at any point other than the thread blocking on something -- means that the thread still wants more CPU, but you can't allow it to have it because there are others that need it. This is fine when you have roughly the same number of such CPU-hungry threads as CPU cores, or even a bit higher -- not when you have 100,000 of those or 1M. So the number of CPU hungry threads must be very low, or your system is grossly under-provisioned.
Now, Erlang has to do it because all Erlang processes are scheduled in user mode, but even in Erlang this doesn't help if you have more than a few such processes. OTOH, on the JVM, where you can pick if you want user-mode or kernel scheduling, you can simply choose the kernel to schedule those few CPU-hungry threads. The kernel does it much better than anything in userspace can.
If you don't have CPU-hungry threads, but well-behaving threads that occasionally become CPU-hungry, it still doesn't help because the work stealing scheduler can very easily deal with such occasional behavior.
In short, time-slice/reduction based preemption in the user-mode scheduler simply doesn't help you in any way. Erlang does it because it has no other scheduler and it must support (a few) processes that behave that way.
1. Preemptive multitasking is strictly better in almost all cases than cooperative multitasking, when you have more processes, not fewer; especially when these processes are heterogenous. Throughput decreases over the best-case cooperative scenario, but jitter significantly decreases, and most importantly, you -- the user, the operator, or the programmer -- have a strong guarantee that the schedulers will never lock up because someone wrote code wrong. That guarantee alone is why every operating system abandoned a cooperative model and pushed schedulers as close as possible to preemptive multitasking.
2. Kernel scheduling is radically worse than user-space scheduling in nearly every scenario. The kernel schedulers are necessarily very generic and workload-agnostic; kernel threads and the thread management infrastructure are incredibly heavyweight and slow relative to user space; the context switching penalty is gigantic; processes relying on kernel scheduling are forced to compete with other processes on the system; and the list goes on. This is why unikernels, exokernels, and user-space tcp implementations like onload are growing prevalent, especially in the low-latency space.
> Preemptive multitasking is strictly better in almost all cases than cooperative multitasking, when you have more processes
Sorry I didn't make myself clear. Quasar most certainly employs preemptive multitasking, just not time-slice- (or reduction-) based.
> Kernel scheduling is radically worse than user-space scheduling
Except when descheduling on time-slices. Remember two facts: 1/ your hardware can only support very few threads that require time-slice preemption so those task-switching costs are negligible, and 2/ no other thread is dependent on low-latency access to data produced a thread that is descheduled on a time-slice, so the latency doesn't matter much either.
None of that is my intuition. I've been profiling and measuring this stuff for a few years now, and I'm a huge proponent of userspace scheduling. Adding time-slice preemption didn't make things any worse -- it just didn't make them any better either, so we took it out, and we tell users to use kernel scheduling for those threads. If the user makes a mistake, we recognize it at runtime and tell her that kernel scheduling might be better for that thread.
"Runaway Fibers
A fiber that is stuck in a loop without blocking, or is blocking the thread its running on (by directly or indirectly performing a thread-blocking operation) is called a runaway fiber. It is perfectly OK for fibers to do that sporadically (as the work stealing scheduler will deal with that), but doing so frequently may severely impact system performance (as most of the scheduler’s threads might be tied up by runaway fibers). Quasar detects runaway fibers, and notifies you about which fibers are problematic, whether they’re blocking the thread or hogging the CPU, and gives you their stack trace, by printing this information to the console as well as reporting it to the runtime fiber monitor (exposed through a JMX MBean; see the previous section)."
that describes a classic symptom of cooperative multitasking, and that kind of broken behavior doesn't happen in a preemptively multitasked system. Certainly it violates the programmer intuition invariant that she can just write programs, if she has to occasionally check on a JMX MBean to find out if the framework has pooped on the bed.
additionally, when you say you are doing preemption without timeslicing, that doesn't really make sense, as all preemption requires slicing time: that's literally what it means.
On the kernel scheduling piece, your argument doesn't make any sense to me. Hardware support is only tangentially to do with high thread context switching costs; and it is frequently the case that many actors can fit into cache, where many threads cannot.
Fundamentally I'm sure Quasar has some great use cases, and probably shows some interesting performance gains over erlang in some of them, but this discussion has dimmed my interest significantly. Quite unfortunate.
Let me repeat: Quasar has the ability to do time-slice scheduling just like Erlang. However, that has proven useless in practice because, unlike Erlang, we have access to kernel scheduling, so we turned that off (I think the relevant code is still there -- it's about ten lines). Switching off that feature was completely backed by evidence, and caused zero harm.
> Certainly it violates the programmer intuition invariant that she can just write programs, if she has to occasionally check on a JMX MBean to find out if the framework has pooped on the bed.
That is easily solvable by having Quasar automatically migrate a thread from being scheduled in user-space to being scheduled by the kernel. However, even that cool feature doesn't justify itself, because things work great as they are. The programmer doesn't need to check JMX occasionally. The behavior is immediately detected in testing and reported as a warning. Simple and effective. Note that all this works even if this behavior occurs in native code, which is more than Erlang does. Threads that are less suitable for user-mode scheduling must be few, and they are very easy to recognize.
Doing any of that automatically may be cool, but has no real value. We might do that at some point, but it's not a high priority.
> as all preemption requires slicing time: that's literally what it means.
No, it means that the scheduler has the ability to deschedule the thread when it wishes[1]. We have this ability, we just no longer use it on time-slices (we easily can, it's just not helpful in the least).
> On the kernel scheduling piece, your argument doesn't make any sense to me. Hardware support is only tangentially to do with high thread context switching costs; and it is frequently the case that many actors can fit into cache, where many threads cannot.
All you say is true, yet none of it matters in practice when it comes to time-slice preemption. Only a handful of threads can enjoy that form of scheduling anyway, whether they take 10MB or 10 bytes of RAM.
"The programmer doesn't need to check JMX occasionally. The behavior is immediately detected in testing and reported as a warning."
Those two sentences are directly contradictory. Either the behavior doesn't happen, or it needs to be checked (via a test suite, or programmatically, or manually). Test suites will go a certain way, but behaviors in production change with new or different data, and the last thing I want to deal with as a developer is trying to remember all the corner cases in my framework when the system is not responsive. Even erlang is difficult; I can't imagine trying to deal with an erlang-like framework with even fewer friendly invariants.
Anyway, it sounds like we inhabit completely different universes, so really there's no point in continuing. Good luck with Quasar. Again, I'm sure it has some great use cases; just not, apparently, common erlang-like ones.
Erlang, AFAIK, doesn't even report runaway processes (that are stuck in native code). Doing all you want is a matter of putting back -- and I'm not exaggerating -- 10 lines of code. Alright, maybe 20. I'm just telling you that it doesn't help. Everything you do in Erlang you can do in Quasar. I can even make you a promise -- if you find that explicitly marking the CPU-bound threads is a burden (and you only have to do that for fibers that are constantly CPU bound -- not bursty ones), I will turn on that feature again, with a big thank-you and a personal mention on our blog. I can assure you that we've spent a long time testing it, and what you imagine to be an issue is not one in practice; not even a slight one (if it were we wouldn't have disabled that code).
Trust me, there are real issues with Quasar, but the one that you fear will bother you isn't one of them.
I like Quasar, but I suspect you removed reduction scheduling because function-call-site reductions, as erlang does natively, is not great when implemented in-language, as it blows up a number of optimizations and muddies the register picture.
That said, I think MCRed is doing a slight disservice even he's got good intentions; erlang is amazing for what it does, quasar is pretty nifty and doubtless has many superior use cases; even go has a lot of utility in the concurrency space. Let a billion processes bloom.