Spark is sort of dead though. Dask looks to be the way of the future. In part be...

jstephan · on May 12, 2020

Thanks for the wishes! Spark is heavily used and its adoption keeps growing, but there are indeed new frameworks like Dask that look promising and are on our radar. Our goal is to foster good practices in the distributed data engineering/science world, whatever the technologies involved, so we'd love to add support for new frameworks in the future.

cozos · on May 12, 2020

Genuinely curious, how do you figure that Spark is "sort of dead"?

missosoup · on May 12, 2020

I've been in the industry for 10+ years. I've worked with everything from telco metrics firehoses to bank customer event streams to deep learning.

The venn intersection of conditions where spark makes sense is really rather narrow. A single high spec instance running leaner tooling will generally meet one's requirements while blowing spark out of the water in terms of perf and cost.

Operationally, spark is a huge PITA, hence databricks and a host of other offerings, I guess including this one, to try to manage the pain. Meanwhile something like dask-kubernetes will cater to the same use case with significantly lower operational complexity and again much higher perf and cost efficiency.

I can't really think of a scenario where I'd choose to use spark on a greenfield project today.