Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Doing sequential reading into a queue for workers to read is a lot more complicated than having a file format that supports parallel reading.

And the fix to allow parallel reading is pretty trivial: escape new lines so that you can just keep reading until the first unescaped new line and start at that record.

It is particularly helpful if you are distributing work across machines, but even in the single machine case, it's simpler to tell a bunch of workers their offset/limit in a file.



The practical solution is to generate several CSV files and distribute work at the granularity of files


Sure, now you need to do this statically ahead of time.

It's not unsolvable, but now you have a more complicated system.

A better file format would not have this problem.

The fix is also trivial (escape new lines into \n or similar) would also make the files easier to view with a text editor.


But in practice, you’ll receive a bag of similar-format CSVs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: