> CSV is more of a rumor than a standard This reminds me of something my boss at...

earthboundkid · on June 3, 2021

Lol, so instead of having an actually working solution (escaping), you had a still broken solution that just didn’t blow up as often so you could ignore it until it caused a crash.

th0ma5 · on June 3, 2021

Escaping breaks often as well. The general problem is that the data is inline with the format. Parquet files or something that has clear demarcation between data and file format are more ideal but probably nothing is perfect or future proof. Or accepting of past mistakes either.

quickthrower2 · on June 3, 2021

You can write a perfectly isomorphic escaping printer/parser

th0ma5 · on June 3, 2021

Obligatory https://xkcd.com/927/ but also all systems would have to be proven correct or else it wouldn't work. "Forgiving" parsers are the norm and you can't rely on what they do deterministically.

fellowniusmonk · on June 3, 2021

I always try to use ascii char 31 (unit seperator) if I'm computationally generating csvs for shunting around data internally.

gfody · on June 3, 2021

it's crazy how our progenitors had the wisdom and foresight to reserve FOUR distinct delimiters for us 28-31 file/group/record/unit but webdevs are just nope we'll go with commas and crlfs and when that doesn't work we'll JSON

elcritch · on June 3, 2021

Not sure I've seen a classic Unix tool utilize \x28-\x31 by default. Ignoring standards in ascii was I'm vogue long before webdevs.

quickthrower2 · on June 3, 2021

JSON is probably popular because JS was popular and it’s a subset. And being a programming language that’s is typed by humans, so those special characters would not have naturally featured.

For data, if we are going to use reserved characters, maybe we just use protobuf and let the serialisation code take the strain.

cerved · on June 3, 2021

ah, yes, CSVs. The format famously preferred and lauded by the web development community

hnick · on June 3, 2021

PSV (Pipe) is also good, tabs can rarely show up in some sets of data like mail addresses if humans key them in. I usually go with one or the other if I have a choice.

hnlmorg · on June 3, 2021

CSV (and it's derivatives) made some sense 20 years ago but these days if you want to make your lives easier with tables you're better off using jsonlines.

  ["Name", "Session", "Score", "Completed"]
  ["Gilbert", "2013", 24, true]
  ["Alexa", "2013", 29, true]
  ["May", "2012B", 14, false]
  ["Deloise", "2012A", 19, true]

Plenty of tools, including some I've written myself, support it.

https://jsonlines.org/examples/

https://murex.rocks/docs/types/jsonl.html

867-5309 · on June 3, 2021

that's just CSV with unnecessary square brackets and whitespace

hnlmorg · on June 3, 2021

As someone who's written parsers for both CSV and jsonlines, I can assure you that you could not be further from the truth:

1. The whitespace is optional. It's just put there for illustrative purposes.

2. Whitespace in CSV can actually corrupt the data where some parsers make incompatible assumptions vs other CSV parsers. Eg space characters used before or after commas -- do you trim them or include them? Some will delimit on tabs as well as commas rather than either/or. Some handle new lines differently

3. Continuing off the previous point: new lines in CSVs, if you're following IBM's spec, should be literal new lines in the file. This breaks readability and it makes streaming CSVs more awkward (because you then break the assumption that you can read a file one line at a time). jsonlines is much cleaner (see next point).

4. Escaping is properly defined. eg how do you escape quotation marks in CSV? IBM's CSV spec states double quotes should be doubled up (eg "Hello ""world""") where as some CSV parsers prefer C-style escaping (eg "Hello \"world\"") and some CSV parsers don't handle that edge case at all. jsonlines already has those edge cases solved.

5. CSV is data type less. This causes issues when importing numbers in different parsers (eg "012345" might need to be a string but might get parsed as an integer with the leading space removed. Also should `true` be a string or a boolean? jsonlines is typed like JSON.

The entire reason I recommend jsonlines over CSV is because jsonlines has the readability of CSV while having the edge cases covered that otherwise leads to data corruption in CSV files (and believe me, I've head to deal with a lot of that over my extensive career!)

hnick · on June 4, 2021

CSV is terrible, and I'd never write my own parser, but there are datasets where tab or pipe can never appear and just using @line = split(/|/, $data) or similar in another language is so convenient for quick and dirty scripting.

hnlmorg · on June 4, 2021

Not just datasets where tab or pipe can never appear, but also that quotation marks aren't used and new lines can never appear (in CSV a row of data can legally span multiple lines because you're not supposed to escape char 12 (or '\n' as it appears in C-like documents).

I do get the convenience of CSV and I've used it loads in the past myself. But if ever you're dealing with data of which the contents of it you cannot be 100% sure of, it's safer to use a standard that has strict rules about how to parse control characters.

hnick · on June 4, 2021

TSV/PSV generally don't allow newlines and commas/quotes are not special so are fine. Though Excel doesn't always play nice, but if you care about data integrity, you won't open it in Excel anyway.

hnlmorg · on June 4, 2021

AFAIK TSV and PSV aren't specs, they're just an alternative delimiters for CSV. To that end most TSV and PSV parsers will be CSV parsers which match on a different byte ('\t' or '|' as opposed to ','). Which means if the parser follows spec (which not all do) then they will allow newlines and quotes too.

I'm not saying your use case is isn't appropriate though. eg if you're exporting from a DB who's records have already been sanitised and wanting to do some quick analysis then TSV/PSV is probably fine. But if you aren't dealing with sanitised data that doesn't contain \n, \" or others, then there is a good chance that your parser will handle them differently to your expectations -- and even a slim chance that your parser might just go ahead and slightly corrupt your data rather than warn you about differing column lengths et al. So it's definitely worth being aware that TSV and PSV suffer from all the same weaknesses as CSV.

867-5309 · on June 4, 2021

thanks for your reply, it was really informative. I'm not a fan of CSV so will delve deeper into jsonlines as an alternative next time it crops up

th0ma5 · on June 3, 2021

https://xkcd.com/927/

hnlmorg · on June 4, 2021

That's the nice thing about jsonlines, it's not creating a new competing standard. It's just making better use of an existing standard (JSON).