This is the one thing that the JSON-against-XML holy warriors need to understand properly. Yes, JSON's less verbose; yes, it's just "plain text" (in as much as there is such a thing); yes, XML makes you put closing tags in - but if you need reliable parsing and rock-solid specifications (and it's reasonably likely that you do, even if you think you don't...), then XML, for all its faults, is very likely the better way.
Xml sort-of died (in many domains) because of its insane complexity and its redundant ways of specifying relations (child relationships vs explicit relationships, tag name data, attributes data, body data), lack of legibility, parser performance (probably an inherent problem due to hierarchical representation?) and other issues like even meaning of whitespace. >50% of the JSON or Xml I've seen would actually be much easier to read and have much clearer semantics, if just written as a relational database. Some time ago, I tried to improve on CSV by specifying it better and making it more powerful. The result was not too bad: http://jstimpfle.de/projects/wsl/main.html . But I think it should be trimmed down even more, and have only built-in datatypes like JSON, to be able to replace it. (More ambitious standardization efforts would lead to similar problems as with Xml, I think). That's why so far I use the approach only in ad-hoc implementations, in different flavours needed for various tasks.
XML parsers are necessarily over-complicated for structured data, because it is a text markup language, not a nested data structure language.
<address>123 Hello World Road, <postcode>12345</postcode>, CA</address>
is perfectly sensible XML. The address is not a tree structure or a key-value dictionary - it is free text with optional markup for some words.
You can use XML to represent nested data structures with lists and dictionaries, but the parsers and their public APIs must still handle the freeform text case as well.
Yep, the application to text documents is valid in my eyes, as well. Although there are lighter weight and/or more extensible approaches, like TeX. (update, clarificaton: I mean just the markup syntax, not the compuational model)
Dear god no. I use and love (La)TeX daily to write documents. But as a markup format for data that's supposed to be processed in any way, other than being fed to a TeX engine, it's absolutely terrible. You can't even really parse a TeX document; with all the macros it really is more a program than a document. XML is far from perfect, but it works well as a markup and data exchange format that is well-specified.
I like TeX for producing documents. But I'd take XML over TeX if I had to parse the markup myself, outside of the TeX toolchain. Any nontrivial TeX document is built out of a pile of macros, so you need to implement a TeX-compatible macro expander to parse it. And at least with XML there are solid libraries, while the state of TeX-parsing libraries outside of TeX itself is pretty poor. I think Haskell is the only language with a reasonably good implementation, thanks to the efforts of the pandoc folks.
RTF is essentially the same syntax, sans the option to define your own markup. Only barely human readable when produced by a word processor, though, but generated TeX is awful as well.
Lighter-weight? Tex is Turing-complete. You can’t even know whether interpreting it will ever finish, and writing a parser that produces good error messages on invalid input is difficult.
From someone who has written TeX macro’s before: you probably mistake the ‘clean’ environment of LaTeX with the core TeX language. The former is reasonable, if very limited, the latter is die-hard “you thought you knew how to program, but this proves you wrong”-material.
XML over TeX any time and LISP-like over XML (with structural macros)
Ah, yes, agreed! That was my first thought as well when I went into the deep flaming pit of TeX macros. Nevertheless, I guess he didn't have much reference material back then or did he?
FWIW, I remember a quote about Knuth and TeX, "He tried really not to make it a programming language, but ultimately he failed". I don't remember who said that.
You're probably thinking of Knuth himself. He's mentioned several times how he never intended to make a programming language, and how puzzled he is that people write programs in TeX macros.
E.g.:
> In some sense I put in many of the programming features kicking and screaming [...] Every system I looked at had its own universal Turing machine built into it somehow, and everybody’s was a little different from everybody else’s. So I thought, “Well, I’m not going to design a programming language; I wanted to have just a typesetting language.” Little by little, I needed more features and so the programming constructs grew.[...] as a programmer, I was tired of having to learn ten different almost-the-same programming languages for every system I looked at; I was going to try to avoid that.
> I was really thinking of TeX as something that the more programming it had in it, the less it was doing its real mission of typesetting. When I put in the calculation of prime numbers into the TeX manual I was not thinking of this as the way to use TeX. I was thinking, "Oh, by the way, look at this: dogs can stand on their hind legs and TeX can calculate prime numbers."
(Coders at Work interview)
In fact, if you use TeX the way Knuth intended and uses it, then the use of macros or programming is really quite minimal. It's only LaTeX that to pursue a better document interface for the user, ends up writing horrifically complex macros -- Mittelbach mentions that nine out of ten "dirty tricks" mentioned by Knuth in the TeXbook are actually used in the source code of LaTeX!
"Insane complexity"? You mean that it actually has a spec instead of the back of a business card with no implementations that agree on what is actually valid?
Yes. No. What it does is too complicated. And I hear that consistent implementations are not a reality for Xml, either (at least regarding implemented features).
I've written JSON parsers to replace platform specific JSON parsers with bug-for-bug (or at the very least misfeature-for-misfeature) parity to port code without breaking it, without too much going terribly wrong. I wouldn't even try to attempt the same for XML.
Generating a useful conservative subset of JSON that most/all JSON serializers will accept hasn't been that hard in practice IME (no trailing commas, escape all unicode, don't assume >double precision/range scalars, etc.), but I still haven't figured out how to do the same for some XML serializers (failing to serialize because it lacks 'extra' annotation tags in some cases, failing to serialize because it doesn't ignore 'extra' annotation tags in other cases...)
Vogons destroyed XML and they would love to destroy JSON. Back away from the JSON vogons, go make another 'simple' format that you put in all your edge cases for complexity. Just try to make a format more simple than JSON, it is based off the basic object, list and basic types string, number, date, bool etc. Where data doesn't fit in those you make it fit or move to another format like YAML, BSON, XML, binary standard like Protobuf or custom binary where needed for say real-time messaging when you control both endpoints always otherwise you have to constantly update a client consumer as well.
JSON is a data and messaging format meant to simplify. If you can't serialize/deserialize to/from JSON then your format might be too complex, and if it doesn't exactly fit in JSON just put the value in a key and add a 'type' or 'meta' key that allows you to translate to and from it. If binary store it in base64, if it is a massive number put it in a string and a type next to it to convert to and from. JSON is merely the messenger, don't shoot it. JSON is so simple it can roll from front-end to back-end where parsing XML/binary in some areas is more of a pain especially for third party consumers.
JSON being simple actually simplifies systems built with it which is a good thing for engineers that like to take complexity and make it simple rather than simplicity to complexity like a vogon.
If other people suggesting that, hey, maybe we should actually be able to express a number correctly makes you splutter about "vogons" or whatever, perhaps it is not they who should take a step back. (For this isn't just "massive" numbers, but anything that isn't a float--themselves ranking just after `null` as the worst disaster in current use in general-purpose programming.)
Telling people to "just" take actions that decrease the reliability and the rigor of their data because of...vogons?...is one of those weird middlebrow things that HN tends to try to steer clear of, last I checked.
(edit: To be clear, I get the reference, I think it's a silly one both for the childish regard the poster to whom I am replying has for other people and textually because it doesn't even hang.)
> If other people suggesting that, hey, maybe we should actually be able to express a number correctly makes you splutter about "vogons" or whatever, perhaps it is not they who should take a step back.
I guess what I am saying is JSON was created for simplicity and needs no updates.
XML has already been created and other formats like BSON, YAML etc or create a new one that suits more detailed needs.
The sole reason that JSON is so successful is it has fought against 'vogon' complication and bureaucracy that riddled XML and many binary formats of the past. JSON is for the dynamic, simple needs and there are plenty of other more verbose other formats for those needs. JSON works from the front-end to the back-end and there are some domain specific ways to store data that is more complex without changing the standard or if that doesn't work, move to another format. The goal of many seem to be to make JSON more complex rather than understand that it was solely created for simplicity. If it is already hard to parse it will be worse when you add in many versions of it and more complexity.
I also find it interesting that we seem to be circling back to binary and complex formats. HTTP/2 might be some of the reason this is happening and big tech turns away from open standards.
Binary formats lead to bigger minefields if they need to change often. Even when it comes to file formats like Microsoft Excel xls for example, those are convoluted and they were made more complex than needed leading Microsoft themselves to create xlsx which is XML based and even still it is more complicated than needed. Microsoft has spend lots of money on version convertors and issues with it due to their own binary choices and lockin [1].
> As Joel states, a normal programmer would conclude that Office’s binary file formats:
- are deliberately obfuscated
- are the product of a demented Borg mind or vogon mind
- were created by insanely bad programmers
- and are impossible to read or create correctly.
Binary that has to change often that is a data/storage format will be eventually convoluted because it is easier to just tack on something randomly to the end of the bin than think about structure and version updates. Eventually it is a big ball of obfuscated data. JSON and XML are at least keyed, JSON being more flexible than XML and binary to changes and versioning.
Lots of the move to binary is reminiscent of reasons before that led to lock-in, ownership and because some engineer needed to put in more complexity for those ends.
There are good and bad reasons to use all formats, if JSON doesn't suit your need for numeric precision or length and you can't store it a bigint for instance as a string with a type key describing it is a big int, maybe JSON isn't the format for the task.
Though SOAP was probably created by vogons straight up primarily as lock-in as WSDL and schemas/dtds never really looked to be interoperable but was looking to own the standard by implementing complexities with embrace, extend, extinguish in mind. SOAP and overcomplexity is the reason that web services were won by JSON/REST/HTTP/RPC as it was overcomplicated.
JSON is Javascript Object Notation and it was created for that reason, because it is so simple the usage spread to apis, frontends, backends and more. People trying to add complexities breaks it for the initial goal of the format.
JSON won due to simplicity and many want to take away that killer feature. Keeping things simple is what the best programmers/engineers do and it is many times harder than just adding in more complexity.
Forget XML. You can argue that JSON is better because it’s simpler, and while I’m conflicted, I know I enjoy working with JSON more.
The real question is: with the benefit of hindsight, could you define a better but similarly simple format?
Would an alternative to JSON that specified the supported numeric ranges be less simple? Not really. Would it be better? Yes. The current fact that you can try to represent integers bigger than 2^53, but they lose data, makes no sense except in light of the fact that JSON was defined to work with the quirks of JavaScript.
It's true that different tools are adapted for different uses. But sometimes one tool could have been better without giving up any of what made it useful for its niche.
> The real question is: with the benefit of hindsight, could you define a better but similarly simple format?
I think the only answer to that question is to build it separate from JSON if you think it can be better, if it is truly better it will win in the market. There is no reason to break JSON and add complexity to the parsing/handling. It is 10x harder to implement simplicity than a format that meets all your needs that ultimately adds complexity.
The problem is when people want to add complexities to JSON. There is nothing stopping anyone from adding a new standard that does do that. But I will argue til the end of time that JSON is successful due to simplicity not edge cases.
Everything you mention can be implemented in JSON just as a string with type info, just because you want the actual type in the format might be the problem, it doesn't fit the use case of simplicity over edge cases. Your use case is one of hundreds of thousands people want in JSON.
> But sometimes one tool could have been better without giving up any of what made it useful for its niche.
Famous last words of a standards implementer. JSON wasn't meant to be this broad, it reached broad acceptance largely because for most cases it is sufficient and simplifies data/messaging/exchange of data. There are plenty of other standards to add complexity or build your own. You use JSON and like it because it is simple.
The hardest thing as an engineer/developer is simplifying complex things, JSON is a superstar in that aspect and I'd like to thank Crockford for holding back on demands like yours. Not because your reasons don't hold value, they do, but because it is moving beyond simplicity and soon JSON would be the past because it will have been XML'd.
In my opinion JSON is one of the best simplifications ever invented in programming and led to tons of innovation as well as simplification of the systems that use it.
If people make JSON more complex we need a SON, Simple Object Notation that is locked Crockford JSON and any dev that wants to add complexity to it will forever be put in the bike shedding shed and live a life of yak shaving.
Correct XML parsing carries at least 1 DoS attack, 1 serverside reflection attack, and those are just the two obvious ones. Hence, any secure XML endpoint must not be conform. That's a pretty nasty situation.
And I'm still traumatized from a university project in which we tried to compile XSD into serializer/deserializer pairs in C and java. The compiler structure was easy, code generation was easy, end2end tests cross-language with on-the-fly compilation was a little tricky because we had to hook up gcc / javac in a junit runner. But XSD simple types are hell, and XSD complex types are worse.
I can always make my JSON act like XML if I want to. When I'm following something like JSON API v1.1 I get a lot of the advantages that I'd get from XML with 99% less bloat. You want types? Go for it! There are even official typed JSON options out there. The security / parsing issues with XML alone are enough for me to rule it out.
How many critical security issues are the result of libxml? Nokogiri / libxml accounts for 50% of my emergency patches to my servers. ONE RUBY GEM is the result of half of my security headaches. That's insane. I only put up with it because other people choose to use XML and I want their data.
How many issues are the result of browsers having to deal with broken HTML (a form of XML)?
JSON isn't perfect, and I wouldn't use it absolutely everywhere, but it's dead simple to parse[0], readable without the whitespace issues of YAML, and I can't think of one place I'd use XML over it.
> There are even official typed JSON options out there.
What are the "official" ones?
Everything I've seen involves validation and explicit formatting for a couple specific types (ex: ISO-8601 dates) but it requires the target to specify what it expects.
There's no way to tell staring at a JSON string if "2018-04-22" is meant to be a date rather than a text string.
Now there's no ambiguity and the serialization is still json compliant. You have to let go of the notion that you can just put a date formatted string in there and things will magically work.
Just for the record - XML and HTML are both subsets of SGML, somewhat overlapping, but by no means coterminous with each other (at least until HTML 5 - I'm honestly not sure what it's relationship to SGML is).
And, speaking from experience, the XML nay-sayers should largely be glad if they never had to deal with SGML :)
HTML pretended to be a subset of SGML, but never really was, and the illusion quickly dispersed as time went on, since HTML was strictly pragmatic and ran in resource-constrained environments (the desktop), while SGML was academic, largely theoretical, and ran on servers, analyzing text.
XML, on the other hand, was more of a back-formation – a generalization of HTML; it was not, as I understand it, directly related to SGML in any way. The existence of XML was a reaction to SGML being impractical, so it would be strange if XML directly derived from SGML.
> XML [...] was not [...] directly related to SGML in any way
That's incorrect. XML is by definition a proper subset of WebSGML, the SGML revision specified in ISO 8879:1986 Annex K. These two specifications were published around the same time and authored by the same people.
In a nutshell, XML added DTD-less SGML (eg. such that every document can be parsed without markup declarations, unlike eg. HTML which has `img` and other empty elements the parser needs to know about) and XML-style empty elements. The features removed from SGML to become XML were tag inference/omission (as used in HTML), short references (for things such as Wiki syntax, CSV, and even JSON parsing), uses of marked sections other than `CDATA`, more complex use cases for notations, and link process declarations ("stylesheets") plus a couple others.
XML was intended as subset of SGML that can be meaningfully parsed without knowing DTD of document in question, which involves removing a lot of weird SGML features and constraining others. Formally XML is not SGML subset as there are some unimportant and some quite critical incompatible details.
The main point of HTML5 is that it is not defined in terms of SGML but by it's own grammar which is in fact described by imperative algorith for parsing it (which also unambiguously specifies what should happen for notionally invalid inputs, AFAIK to the extent that for every byte stream there is exactly one resulting DOM tree).
HTML5 is almost a subset of SGML, barring some ambiguities in itz table spec, HTML comments in script tags and the spellcheck and contenteditable attributes.
If you think XML doesn't suffer from all the same issues, you haven't used it enough. I'd use protobuf for something that needs stict serialization and parsing.
It's worth pointing out that libxml2 also contains an HTML parser, implementations of XPath, XPointer, various other half-forgotten things beginning with the letter X, a Relax-NG implementation, and clients for both HTTP and FTP. The actual XML parser doesn't need any of that, and almost certainly takes up a lot less than 1.4 MB.
But that's the point, isn't it? S-expressions are light because they define very little, it's only a tree of undefined blobs of data (atoms). It's even more limited than JSON.
I'm not saying it can't be mapped; I'm saying it loses semantics in the translation. For example, how do you represent a boolean in a s-exp, such as that anyone with "the s-exp spec" can unambiguously know that's a boolean?
The problem is that any realistic application imposes more semantics than just the JSON semantics on its data. A particular string isn't just as a string: it's really an address, or a name, or whatever; a particular number isn't just a number, it's an integer, or an ed25519 key, or a percentage: JSON is as incapable of encoding those semantics as are S-expressions (it's also incapable of encoding an ed25519 key as a number anyway — at least portably so!).
Except … canonical S-expressions do have a way to encode type information, if you want to take advantage of it: display hints. [address]"1234 Maple Grove Ln.\nEast Podunk, Idaho" is an atom with a display hint which encodes the type of the string. You could do the same for lists, too: (address "1234 Maple Grove Ln." "East Podunk, Idaho") could be the same.
Does this require application-specific knowledge? Yes. Because it's application-specific. JSON offers you just enough rope to hang yourself: you think, 'oh, these are all the basic types anyone needs' — but people need more than basic types: they need rich types. And the only way to do that is … an application-specific layer which unmarshals data into an application-specific data structure.
It's very appealing to think that an application-agnostic system can usefully process data. That was part of the allure of XML; it's part of the allure of JSON. And while it is true that application-agnostic systems can be a great 85% solution, they tend to break, badly, when they're pushed to the limit. Our time as engineers & scientists would be better spent, IMHO, building systems which enable us to rapidly create application-specific code rather than systems which enable us to slowly create partially-universal code.
It's very appealing to think that an application-agnostic system can usefully process data. That was part of the allure of XML; it's part of the allure of JSON.
If you don't think an application-agnostic structure is useful, then you wouldn't use S-expressions, you'd use a sequence of arbitrary bytes, and leave all the processing to the application. Using anything else is a compromise between generic and application-specific.
Won't you inherit all the encoding trouble this article goes on about? As well as the problems with integer precision? And get a bunch of new ones with special characters in keys...
I'm sure you _could_ specify a nice sexpr format. I'm not sure the specification would be simple though. And just saying "use sexprs" leaves you with all the problems you have when you say "use json".
To put it another way, the biggest problems with JSON aren't with the representation but with the semantics. S-expressions have no inherent semantics beyond those provided by the structure. Take this S-expression:
((:foo ("bar" 1))
and this JSON:
{ "foo": [ "bar", 1 ] }
and you'd think they have the same meaning, but they don't. The JSON decodes as "an object with one key, foo, which maps to an array containing the string 'bar' and the number 1." The S-expression decodes as 'the atom :foo is paired with the pair of the atom "bar" and the pair of the atom 1 and the pair of the nil atom." How to interpret the atoms :foo, "bar", and 1, as well as the meaning of their relative positions in the S-expression structure, is the actual hard part.
ETA: I just realized I pretty much repeated what an earlier comment said re: semantics. Sorry for the redundancy.
> How to interpret the atoms :foo, "bar", and 1, as well as the meaning of their relative positions in the S-expression structure, is the actual hard part.
Indeed, which is surely a downside of S-expressions? Anyone reading the JSON will agree that "bar" and 1 are in a linear sequence, and that sequence is somehow labelled "foo". Whereas different readers of the S-expression might not even agree on that much.
No. This is not how you should approach this problem.
The main problem with XML in this regard is a lack of proper data model. In JSON you have a single, mostly consistent data model that is `value ::= atom | [value...] | {key:value...}`. As a data format JSON is half-baked and inefficient, but as a data model it is very clear. On the other hands, XML shines when you actually need the semi-structured markup (which would be very mouthful to represent in JSON).
I’m very confused. What’s hard about nesting XML documents? That’s one of XML’s strengths, and JSON’s weaknesses, because JSON doesn’t have namespaces.