Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Parsing XML at the Speed of Light (2012) (aosabook.org)
53 points by Tomte on July 19, 2022 | hide | past | favorite | 52 comments


Part of me wishes XML was still in vogue, it and related technologies solved a lot of problems that alternatives like JSON and YAML are still trying to solve.

I mean the last thing I heard from XML was EXI, a new way to transfer XML documents; instead of encoding it into an ASCII representation and gzipping it, it instead (iirc) turns an XML document into a binary representation or stream containing events similar to SAX as mentioned in this article. That could then be reshuffled and compressed for further efficiency. It was a way to improve transfer performance by a lot at the cost of human readability on the line, but, nobody should read XML documents on the line and they would probably use a client (that understands EXI) anyway.


Never understood the typical all/nothing hivemind developer view around XML and JSON/YAML. It is and should always have been a case of horses for courses - XML absolutely has its uses (structured markup, mixed human readable/data islands, schema validation and multi-schema markup), yes it was bad to try to use it for everything and it is poor and long-winded for short repeated key/value item markup, but its equally bad today to treat it as passe and useless, it excels in several areas where YAML manifestly fails. Outside key/value markup of fairly shallow tree structures where schemas are singular, absent or in flux (in nosql style) or for machine consumption only, JSON/YAML is also the wrong tool to reach for.


It was an overcomplicated design-by-committee mess.

The only cases where XML tech has survived is where it got in early and inertia has kept it there and because it was "good enough".

I genuinely dont know of anybody building new technology on a foundation of XML. There's always something simpler, with a lower bug surface that will handle the job better - even in its originally intended domain of document markup.


There's always something simpler than xml and json, and yaml is anything but simple. They didn't try to reduce the bug surface, json tried to be parseable with eval, yaml tried to be succinct to the point of deliberate ambiguity.


I don't know, xml represents multiline values so much better than json/yaml. Then when it comes to escaping reserved characters, wrapping it in cdata is a lot cleaner than json/yaml escaping.

I am currently building something new that uses xml for a properties file. My use case requires sql inside a properties file (without the use of additional .sql files). Really wanted to use json/yaml because I felt like I was in the 90s, but xml was just a lot better for this case.


  This: |
    Always "seemed" a lot cleaner to me.
I dont see why it doesnt accomodate SQL just fine.

JSON was never intended primarily for user readability. It's for sending data over a wire. Its focus was on parsing simplicity, ease of serialization and deserialization and to be incidentally readable.


Only works well if your text is a leaf and has no other contained entities/structure.

<entity attr="whatever">Try to make this <appearance>look good</appearance> in <markup>YAML</markup></entity>

As I said, horses for courses. FWIW I actually like YAML for certain limited use cases, just not where XML is better. And vice versa.


  entity:
    attr: whatever
    text: |-
      Try to make this *look good* in **YAML**.
^ I have quite a lot of YAML hanging around these days like this.

Markdown is surprisingly effective as an XML document markup replacement.


Consider the case of embedding (composition is better than inheritance) data defined by a second schema inside a first schema. To quote The Zen of Python, "Namespaces are one honking great idea -- let's do more of those!". But no other format does embedding in any reasonable way (it's all ad hoc extensions on top of your serialization format).


>Consider the case of embedding (composition is better than inheritance) data defined by a second schema inside a first schema.

What do you mean by this "data defined by a second schema inside a first schema"? And why is it a good idea?


You have a text document (eg, LibreOffice) which is XML, and want to embed a SVG in it (which is also XML).

XML has tools to deal with this situation, where it knows that what's found inside <picture></picture> can be evaluated according to the SVG spec.

You also have namespaces, to deal with that a <title> might be a thing both in a text document and a song, and you may need to use both inside the same document without conflicting.


> Consider the case of embedding (composition is better than inheritance) data defined by a second schema inside a first schema.

How, exactly, is JSON with JSON Schema inadequate for that?


The only thing I really liked about XML was schemas (XSD I think?).

The ability to design a schema that could be passed around to let people validate the data and structure of an XML file before sending it was really slick. The schema designer tools were easy to use.

I actually enjoyed using WSDL based web services specifically because of this. You could take the file, which was just a schema and generate all the code and data structures you needed to interface with the API from any language…and you got the ability to validate your data client side before even bothering to send it.

It definitely had some pain points, but overall these were the good points that I remember.

GraphQL is probably the closest next evolution.


JSON Schema is a thing


I've just started looking into switching our integrations from XML to JSON.

Tooling around JSON Schema seems to be a bit lacking though. Like, Swagger UI and editor doesn't fully support the latest JSON Schema version, especially w.r.t. choice-like oneOf elements.

And if any of you know of any good tools to visualize a large JSON schema, similar to XMLSpy, that's compatible with the latest JSON schema version then I'd love to hear about it.


> Part of me wishes XML was still in vogue, it and related technologies solved a lot of problems that alternatives like JSON and YAML are still trying to solve.

If you have a problem solved by XML and not other technologies, you can still use XML.


> you can still use XML

Make sure you have thick skin. Your peers will laugh at you. And your colleagues may veto your choice.


Until you ask them to propose a robust solution for document structural validation.


> EXI, a new way to transfer XML documents; instead of encoding it into an ASCII representation and gzipping it, it instead (iirc) turns an XML document into a binary representation or stream containing events similar to SAX as mentioned in this article. That could then be reshuffled and compressed for further efficiency

How would this compare to protobuf, in terms of efficiency?


I don't think anyone's tried benchmarking protobuf against EXI yet, but I've played around with a .NET implementation (Nagasena: http://openexi.sourceforge.net/) and it does outperform Gzip both for compression efficiency and decompression/runtime performance, so long as you use a schema-accelerated parser. It can infer a schema (turning it into a 'grammar' to use the EXI-specific term), but a properly constructed XML Schema will always be fastest/most efficient, especially in low memory systems.


What problem did XML solve that JSON is still trying to?


Comments. And before you mention JSON5, so you know anyone who uses it?



XML's implementation for this is surprisingly intuitive. But to be fair, who tries to use JSON for modelling something like this? You're better off just embedding the XML in a JSON string:

{"markup": "<div>The <a href=\"https://www.json.org/\">JSON format</a> was invented by <em>Douglas Crockford</em>.</div>"}

Most JSON is consumed in the browser, which also has an XML/HTML parser engine sitting alongside the JSON one.


> to be fair, who tries to use JSON for modelling something like this?

There are people who hold the opinion that "Everything XML wanted to be, JSON did better." <https://news.ycombinator.com/item?id=28667089>


AFAIK some matrix client wanted json markup for messages.


Come to the Java and .NET world, still used a lot.


Ah memories: parsing XML fast is what lead me to work for Apple. I gave a talk about it (based on https://webhome.cs.uvic.ca/~nigelh/Publications/IDEAL02.pdf: Expeditious XML processing, Yeow, Horspool, Levy) to the Vancouver XML group. That led me to be approached by the chair of the XML Vancover group who happened also to be the director of a Vancouver software company. They hired me and about a year later the company was acquired by Apple. And so that part of my adventure began.

(The fundamental idea of the parser was to use ideas taken from Warren's Abstract Machine for Prolog)


XML was and remains amazing - I write XSLT for bulk data transforms every now and then and it is super performant. Even querying JSON is fast via XPath - https://www.defiantjs.com/

It’s downfall was really how broad it ended up being with way too many ways to represent data structures - JSON was refreshing with it’s limited vocabulary.


What do you use to evaluate XSLT to perceive it as "super performant"? I had the misfortune to deal with XSLT and I failed to perceive it as fast. In fact we had a lot of performance issues with XSLT processing. I wouldn't describe XPath as fast either.


The last time I used XSLT was in a Microsoft SSIS package to decode XML payloads into database rows that could be merged into SQL server tables.

Before I did this the previous engineer had written a bunch of stored procedures to do this that was both brittle and slow (processing time was 11 hours) but the SQL was fairly decent.

Once I used XSLT to do this the task ran in under 90 minutes and it was far easier to maintain (new fields, changing columns, etc).


Rather, it's your computer that is performant.


I remember reading this when it came out, it really informed and confirmed some of my data intuitions. Absolutely fascinating material.

The allocator section is particularly clever and shows just how precise and utilitarian the design is.

Ironically, I’ve only ever ended up parsing a handful of XML in my life, but it’s reassuring to me that people will look into these algorithms and see how beautifully crafted they had to be in order to go so fast.


I wrote an XML parser from scratch one time. It was fun! I just needed to read in an XML file into a C++ program, and pulling in a third party library wasn't a viable option.


> pulling in a third party library wasn't a viable option

How? I would violently object for implementing an in-house XML parser as opposed to just finding a way to include libxml2. It's under the MIT license, so I doubt that was an issue.


libxml2 can be massive overkill if you just want to parse XML data. Unless you need schema validation and namespaces and all that jazz, you can parse XML in C with a simple state machine of ~200 lines. Add another 200 to build a DOM-like structure.

https://github.com/ccxvii/snippets/blob/master/xml.c#L326

Or in lua with ~200 lines altogether:

https://github.com/ccxvii/snippets/blob/master/xml.lua


Yes, libxml2 is a widespread, thus mostly unproblematic choice. However pulling in a third party is not only about license.

With a thirdparty library you have to follow security updates (also to parts you don't need), may have build issues on some platforms you want to support (something embedded?) and so on.

If you don't need XML, but just a few xml-like tags a custom parser can be done quickly. But yes then you get to the NHI problems and what if your assumptions about XML subset were wrong ...

Personally I would lean on an XML lib (maybe libyml2, maybe qxml, maybe windows API, ... depending on context) as XML is complex, but generally adding libraries introduce their costs.


I didn't need to parse any XML file, but rather a specific XML file. It was inside an ITAR restricted codebase, and lawyers are conservative about open source in that context. In terms of security, it wasn't internet facing, and was written to the same standard as safety critical embedded software.


Do you have other objections beyond the wasted effort of re-writing the wheel?


I think a lot of people trivialize XML parsing here. Given how many dedicated libraries screwed up JSON parsing, which should be much easier, I have very little faith in people getting XML parsing right in-house.


There are many security vulnerability vectors in xml parsing. A lot (but not all) have to do with DTD entity replacement, calls to remote URLs for schemas, and the like, buffer overflows (values too large).


I recently made a very stripped down XML to JSON parser, specifically for parsing RSS feeds, for a react native project:

https://xnacly.vercel.app/blog/2022_07_18_rss_parser


I worked on a hardware XML accelerator. It would deposit the parsed and validated XML tree into CPU memory, ready for use by software.

https://patents.google.com/patent/US9110875B2

The hardware was an FPGA-based PCIe board, but the design also ended up in a custom CPU.

Here's one thing I remember: namespace prefix declarations are applied retroactively within XML elements. This is not easy to deal with when trying to design streaming hardware..


Yes, this complicates the parsing quite a bit: you have to read the element header with attributes into a buffer and keep them there until you're sure there's no changes in the default namespace or namespace prefixes. And when you start reading a name character-by-character you have no idea whether what you're reading is a namespace prefix or the local name, which affects the construction of the lookup table.

The added complexity is in a stark contrast with the namespaceless XML core, where all parsing decisions require a single character lookahead and you can report items as soon as their data is ready. It's interesting to think that a slightly different syntax for namespaces would make these problems non-existent. E.g. if namespace prefixes were at the end of a name (template:xsl) and namespace declarations were standalone elements enclosing the rest of the XML, it could be much simpler.


It is known that "speed of light" PugiXML can be replaced by fast processing in higher level language like Haskell: https://arxiv.org/pdf/2011.03536.pdf


Abandonig XML would help the climate immensly.


If I had to parse XMLs in my job today, I’d quit. It is equivalent to writing code in VB6 for me.


var dom := XMLparse(<some xml doc>);

What's the problem?

Edit: if you mean be in the extremely rare state of having to write an XML parser, it's very straightforward.


I have a similar sentiment, but generalised to any file format. No one should be hand writing parsers or parsing code for XML, JSON or any other format for that matter. It’s 2022; data needs to come in a structured format with a schema. And preferably with tooling to generate code or parsers for said schemas.


> data needs to come in a structured format with a schema. And preferably with tooling to generate code or parsers for said schemas

Where is your engineering, inventor spirit? This seems snowflakish to me. You want everything handed to you on a silver platter. But the reality is there’s lots of data out there that has no schema and never will. And if you don’t have proper tooling, you’d quit? Why not MAKE your own tooling?


>Where is your engineering, inventor spirit? This seems snowflakish to me. You want everything handed to you on a silver platter. But the reality is there’s lots of data out there that has no schema and never will. And if you don’t have proper tooling, you’d quit? Why not MAKE your own tooling?

Oh I meant to say I feel the same the way as the OP ('similar sentiments'), but I wouldn't literally quit my job. You're right, there's always going to be data out there with no schema; CSVs are one example. But if in your day to day work you're constantly encountering data with no schema or the structure/format/schema is constantly changing, then chances are you're dealing with a crappy system or working in a crappy job.

And making your own tooling is fun and all for the first time, but there comes a point when after you accrue enough programming experience, you realise how redundant it is, especially if the data is in a format as old as XML for example. There's a massive amount of tooling around XML, from XSLT, XQuery, XPath and all of them have facilities for dealing with schema-less XML data, but if you were to use a schema (or the vendor/3rd party provides one), then 90% of issues (and sometimes even code) dealing with schema-less XML go away. And the same is true for JSON as well.

I find it incredibly funny that people heap scorn onto XML as being some horrible legacy thing to work with, but then will go on happily working with JSON/YAML/or whatever and then have to deal with ALL THE EXACT SAME ISSUES, all the while not realising XML-related tooling has already solved them.


My POV:

Until 2020 I worked in data exchange between an online retail company (think Amazon in the good old days) and its suppliers.

I wish people used XML more. Not because it's cool or practical or any of those reasons.

It's because 90% of non-tech-savvy people still do their IT work by someone manually filling in cells in an Excel worksheet.

I do understand your attitude - but I think the world would be served a lot better if we accepted formats that were viable, just not en vogue, if it at least got rid of all the crap we are using instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: