Binary XML Idiocy

Dec 21th, 2005

Joe Gregorio on the worst W3C idea in a long time.

Comments

On November 22, 2005 11:21 AM, Mike Glendinning said:

Whilst I also have doubts that this W3C initiative will deliver anything useful, it’s probably about time we cleared up this ridiculous “text versus binary” debate.

As Sam Ruby points out, and you reference in another part of your blog, there remain significant difficulties dealing with a supposedly “simple” text encoding.

The plain fact is that there is really no such thing as what is commonly called “text”. As far as I’m aware, all of our [digital] computers these days operate in “binary”.

We have a number of techniques for representing the character shapes used in natural language using binary codes, for example ASCII, ISO-8859-1 and UTF-8.

We also build tools in order to interpret these binary codes and display them on our computer screens, for example SimpleText/TeachText, vi and Notepad.

But all of these tools are simply helping us decode and understand binary information.

I’d hazard a guess that most people (developers) dealing with XML these days don’t use such simple “text” tools at all - it’s far too hard figuring out what’s going on amongst all the namespaces and so on! Instead, they use a browser such as Internet Explorer or more likely something like XML Spy in order to parse and interpret the binary code in an easier-to-understand fashion.

So if everybody is using a tool of some sort (whether simple or sophisticated) in order to display and process the “text” information, who should care what the underlying “binary” encoding is?

Presently, the XML family of standards (XML 1.0/1.1, namespaces, schema, WS-* and so on) is becoming rather complex and unwieldy, with many poorly-designed and “leaky” abstractions.

Back in 2000 and early 2001, I received much praise for deploying a production system for T-Mobile using web services technology such as XML, SOAP and WSDL. Looking at the current state of these standards and their successors, I would have a hard time convincing myself of the same decision!

At some point, the mess and complexity of the XML and web services standards is going to hold our industry back and the technology will need to be replaced.

Wouldn’t it be good if somebody was thinking about its replacement with a cleaner and better designed solution? And who should care if such a replacement were based on “text” or “binary”?

But as you say, I don’t believe current W3C initiatives are likely to achieve any of this!

-Mike Glendinning.

P.S. Yours is a nicely entertaining blog, by the way…

On November 22, 2005 9:45 PM, Stefan Tilkov said:

Mike, good to hear from you, and thanks for the positive comment :-)

I don’t think I agree with you in this case, though. Obviously, not everything is perfect in the XML world; things like XSD and WSDL are not that great, except for showing what premature standardization can lead to. Still, I think that it’s the fact that XML is text that has lead to its widespread adoption. Granted, had namespaces or some equivalent been part of the original design it would have led to a much better result, but the benefit a “clean” successor would bring is very likely not worth the cost.

On November 23, 2005 4:56 PM, Mike Glendinning said:

Stefan,

I don’t disagree with you that its basis in a “text” encoding has led to the success of XML in the past. It’s a question of being able to make good use of the available tools, and XML chose wisely.

The ASN.1 community failed to win broad adoption due to the standard being defined first, and [limited] tool support only appearing much later. That, and the related problem of too much flexibility/too many options in the encoding definitions meant that ASN.1 was never used by anyone except dedicated specialists.

But my point is that moving forward, XML must let go of some of its heritage, not least its “text” encoding and its SGML beginnings (PIs, DTDs, entities and so on). Cleaning up namespaces and schema and many others things would be on my “to do” list as well! The work on infoset is a start, but much more is needed (just don’t get me started on the “benefits” of basing everything on a tree structured data model!).

I’d be rather surprised (and disappointed) if in 10 or 20 years time, developers were still debugging XML machine communication using tools like “SimpleText” and running up “telnet” sessions to check out the behaviour of HTTP servers. After all, XML itself hasn’t even been around that long!

I often liken this kind of technology evolution (i.e. of layered software stacks) to a big game of “Tetris”. Once the lower layers are completely filled up and solidified, the game moves on and the lower layers can be eliminated.

In software terms, this means that developers are continually working only with the new upper layers, usually at a higher level of abstraction than before. Nobody stops to care about what’s happening beneath. For example, in debugging distributed systems, say 5 or 10 years ago, I often used to have to delve in to what’s happening at the TCP and IP layers (it helped that I had designed/built a full TCP/IP stack for embedded computers in the past :-). Whilst this kind of thing may still be required very occasionally, for the most part people don’t care about these lower layers any more and have moved on to solving different and more interesting problems.

Make no mistake, this will happen with the XML and web services stack. We can argue about the means and timescales over a beer sometime…

-Mike Glendinning.

On November 23, 2005 10:08 PM, Stefan Tilkov said:

I don’t think it’s a coincidence that the most significant application protocols on the Internet - those that form the backbone of our infrastructure - rely on text messages. I don’t think it’s a matter of tools; that doesn’t mean there can’t or shouldn’t be any, just that the wide availability of great tools to work with a format does not make them equivalent to text. Nothing will ever beat text in terms of tool availability and programming language support :-)

We should definitely meet for that beer sometime.

On November 24, 2005 3:43 PM, Mike Glendinning said:

I don’t want to continue this indefinitely - after all it’s your blog and you’re bound to have the last word - but I am in an argumentative mood at the moment :-)

There is a difference between using a “text” encoding for control and protocol messages (things that are usually short) and using it as an generalised representation format for the storage, transmission and manipulation of all information. As far as I can see, XML aspires to be the latter.

Why is it that “binary” picture formats such as GIF, JPG and various MPEG are much more successful than “text” ones such as XBM, HPGL and SVG? Surely it would be great if you could process your pictures using “SimpleText” as well as XML tools such as XQuery, XSLT and so on? Why isn’t the world stampeding to redefine JPG using XML and reaping the benefits? I wonder…

By the way, I also had to double check whether you were one of the authors of the wonderful “BLOAT” specification (http://www.ietf.org/rfc/rfc3252.txt) which I’m sure meets with your approval! Let’s get that implemented straightaway!

Despite advances in computer and network processing capacity, efficiency still counts and has general economic benefits. Rigorous but flexible definition of data formats is also important in ensuring the availability of conforming, interoperable, efficient and secure implementations.

Presently, XML has neither of these two qualities. Aside from efficiency, consider all those browser hacks involving embedding HTML tags in form data, unchecked and indiscriminate reference to external entities, change in XML behaviour due to processing of DTDs and entities, etc.

The world needs, and should be demanding a much better XML!

On November 24, 2005 5:31 PM, Stefan Tilkov said:

Come on, BLOAT is not to be taken seriously … it doesn’t even use proper namespaces.

:-)