> It's super annoying if a tool works on a 1K line CSV file, but breaks down if I have a 3 line file because it can't infer the type.
How can that be? Is it possible to have such an ambiguous file? I mean, if a file contains a single number on a single name it can be anything, but the interpretation is the same. Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?
>Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?
Easily!
a\tb,c
Is "a\tb" and "c" as a csv, but "a" and "b,c" as a tsv!
In practice, if the first line contains only a tab or a comma it might be enough to infer that as the separator, but:
1. that would fail on single-column files (by misinterpreting them as multi-column if the unused separator appears)
2. that couldn't infer anything on files where both separators appear
So it would only be a 90% (or maybe 99%) solution.
If it's a single column file, you - the user - should know it and act accordingly. Yeah scripting usage on random input, II know, but in that case then you would specify the input type. But heuristic defaulting to the most common usage in case of doubt (that would be CSV in the CSV/TSV doubt) for the interactive use is the way to go. Or at least, it's what I would expect personally as a user.
Heuristics are annoying and data-dependent algo changes are dangerous.
Completely different example, but limits in Splunk aggregations - it means you can run your report on small data, but when you scale it up (to real production data sizes, maybe), then suddenly you get wrong numbers, and maybe results like "0 errors of type X" when the real answer is that there are >=1 errors. Because one of the aggregations used has a window size limit that it is silently applying. This stuff is dangerous.
What Splunk was doing for me would be the equivalent of an SQL join giving approximate answers when the data is too big.
The issue with the heuristic is that it can fail depending on the input, and the input can easily change in a way that kills the heuristic.
Say you run the tool on a file, and it detects csv input and all is well. Then you update the file, and now it includes a tab character in the first line and the heuristic detects it as a tsv and now it fails - or the heuristic now gives up, or whatever.
Sure you can "improve the heuristic", but you can still, always, have the data change in a way that it defeats it. You now need to either be careful with the data or _know_ that you should specify the format, without the tool telling you. Everything seems to work immediately, and then later it blows up. That's a problem akin to e.g. bash and filenames with spaces. Everything works, until someone has a space in a filename, and then you get told that you should have known to quote everything all along (the solution there, would be to abolish word splitting).
To coin a pithy phrase: When a tool is easy to misuse and a user misuses it, blame the tool, not the user.
Now, if I were writing this thing I would make the logic much simpler: Make it default to csv (or whichever format is more common). Now the way to break the "heuristic" is to give data in the wrong format. But if you use csv, you don't have to explicitly give the format and your data can't break the heuristic (unless it switches format, which you would know about).
Not only is it possible, as a couple of commenters have already shown, but, due to the many variants of CSV out there, it's possible to construct a CSV that has different contents depending on which dialect you tell your CSV reader to expect. I'll leave the actual construction as an exercise to the reader, but, it would work along the same lines as the ambiguous TSV/CSV files you've seen here already.
How can that be? Is it possible to have such an ambiguous file? I mean, if a file contains a single number on a single name it can be anything, but the interpretation is the same. Can you create a file that has different contents depending on whether it is interpreted as csv or tsv?