There is No Invalid HTML

Thursday, January 15th, 2009

There are many different kinds of user agents for HTML markup. The most popular and most important user agents today are display-based web browsers. There are other types of HTML consumers, such as search engines, aggregators, text-based browsers, and speech browsers. There are also many different types of producers of HTML.

One of the the most important features of the HTML5 specification from the perspective of user agents is that it specifies how to parse and consume all markup, not just correct markup. The vast majority of markup on the web is not “valid”. This will undoubtedly continue to be true, and it’s not a bad thing: imagine a dystopian world where only complex tools or skilled technicians could create web content!

The HTML5 parsing specification contains rules to transform any possible sequence of characters or bytes into a standard document object model. From conversations with Ian, I believe this was one of his primary goals for the initial HTML5 specification. I’m a little surprised that this is not called out more clearly in the parsing section of the specification.

Setting aside the unanswerable questions of whether generic metadata can be used to solve problems at web-scale, or whether RDFa can solve the metadata problem, most of the discussion on the WhatWG mailing list regarding whether RDF (RDFa) should be integrated into the HTML specification has focused on whether the RDFa would make the markup invalid HTML.

If everyone actually implements the HTML5 parsing specification, who cares whether it’s valid markup? You get the same document structure in every case. User agents which are aware of RDFa and wish to use it to solve problems may do so. This seems like the ultimate extensibility mechanism you could possibly want. The only “problem” is that a validator (conformance checker) will warn you that you’ve produced invalid markup.

Perhaps, rather than ranting about invalid markup, the specification should be altered. Remove all references to parse errors, and instead require parsers to interoperably transform any possible sequence of characters into the same document model. Then those who wish to use RDFa markup may do so, and user agents can ignore or process this metadata as appropriate.

One possible objection to this error-less regime is that it doesn’t give useful guidance to content authors or authoring tools. If everything is valid HTML markup, there still should be best practices for authoring HTML. If you want your content to work with existing browsers, reverse-engineering existing practice is a tricky exercise. Perhaps there should be a section of the HTML5 specification indicating how authoring tools should interoperably serialize a given document model. Then it would be relatively simple to write a fuzz tester to take a document, serialize it per the serialization spec, re-parse it per the parsing spec, and then compare the results for equality.

The discussions about HTML and RDF have drifted from practical interoperability into theoretical “validity”, purity, and architectural grandiosity. Let’s get back to the specific technical question that needs to be answered: if you embed RDFa in HTML5, does it parse into a usable DOM? If not, are there specific changes to the parsing specification that will allow it to parse to a usable DOM?