> At the risk of repeating myself: I don't have any conclusions. For someone who...

> At the risk of repeating myself: I don't have any conclusions.

For someone who doesn't have any conclusions, you're making a lot of assertions that don't jive with reality.

> And yet there is cost. It is unclear if that cost is a factor.

It's a factor... just not the factor you think it is.

> Because they are not useful.

I think you grokked it.

> The memory-central approach clearly wins out so heavily (and the fact we can map-reduce across cores or machines as our problem gets bigger) is a huge advantage in the KDB-powered solution. It's also the obvious implementation for a KDB-powered solution.

KDB is a great tool, but you are sadly mistaken if you think the trick to its success is the runtime. That its runtime is so small is impressive, and a reflection of its craftsmanship, but it isn't why it is efficient. For most data problems, the runtime is dwarfed by the data, so the efficiency that the runtime organizes and manipulates the data dominates other factors, like the size of the runtime. This should be obvious, as this is a central purpose of a database.

> There are a lot of questions here that require more experiments to answer, but one thing stands out to me: Why bother?

Yes, you almost certainly shouldn't bother.

Spark/Hadoop/etc. are intended for massively distributed compute jobs, where the runtime overhead on an individual machine is comparatively trivial to inefficiencies you might encounter from failing to orchestrate the work efficiently. They're designed to tolerate cheap heterogenous hardware that fails regularly, so they make a lot of trade-offs that hamper getting to anything resembling peak hardware efficiency. You're talking about a runtime fitting in L1, but these are distributed systems that orchestrate work over a network... Your compute might run in L1, but the orchestration sure as heck doesn't. Consequently, they're not terribly efficient for smaller jobs. There is a tendency for people to use them for tasks that are better addressed in other ways. It is unfortunate and frustrating.

Until you are dealing with such a problem, they're actually quite inefficient for the job... but that inefficiency is not a function of JVM.

Measuring the JVM's efficiency with Spark is like measuring C++'s efficiency with Firefox.

> If I've got a faster tool, that encourages the correct approach, why should I bother trying to figure these things out? Or put perhaps more clearly: What do I gain with that 10mb?

If you read the documentation, the gains should be clear. If you are asking the question, likely the gains are irrelevant to your problem. I would, however, caution you to worry less about the runtime size and more about the runtime efficiency. The two are often at best tenuously related.