There is No Invalid HTML
There are many different kinds of user agents for HTML markup. The most popular and most important user agents today are display-based web browsers. There are other types of HTML consumers, such as search engines, aggregators, text-based browsers, and speech browsers. There are also many different types of producers of HTML.
One of the the most important features of the HTML5 specification from the perspective of user agents is that it specifies how to parse and consume all markup, not just correct markup. The vast majority of markup on the web is not “valid”. This will undoubtedly continue to be true, and it’s not a bad thing: imagine a dystopian world where only complex tools or skilled technicians could create web content!
The HTML5 parsing specification contains rules to transform any possible sequence of characters or bytes into a standard document object model. From conversations with Ian, I believe this was one of his primary goals for the initial HTML5 specification. I’m a little surprised that this is not called out more clearly in the parsing section of the specification.
Setting aside the unanswerable questions of whether generic metadata can be used to solve problems at web-scale, or whether RDFa can solve the metadata problem, most of the discussion on the WhatWG mailing list regarding whether RDF (RDFa) should be integrated into the HTML specification has focused on whether the RDFa would make the markup invalid HTML.
If everyone actually implements the HTML5 parsing specification, who cares whether it’s valid markup? You get the same document structure in every case. User agents which are aware of RDFa and wish to use it to solve problems may do so. This seems like the ultimate extensibility mechanism you could possibly want. The only “problem” is that a validator (conformance checker) will warn you that you’ve produced invalid markup.
Perhaps, rather than ranting about invalid markup, the specification should be altered. Remove all references to parse errors, and instead require parsers to interoperably transform any possible sequence of characters into the same document model. Then those who wish to use RDFa markup may do so, and user agents can ignore or process this metadata as appropriate.
One possible objection to this error-less regime is that it doesn’t give useful guidance to content authors or authoring tools. If everything is valid HTML markup, there still should be best practices for authoring HTML. If you want your content to work with existing browsers, reverse-engineering existing practice is a tricky exercise. Perhaps there should be a section of the HTML5 specification indicating how authoring tools should interoperably serialize a given document model. Then it would be relatively simple to write a fuzz tester to take a document, serialize it per the serialization spec, re-parse it per the parsing spec, and then compare the results for equality.
The discussions about HTML and RDF have drifted from practical interoperability into theoretical “validity”, purity, and architectural grandiosity. Let’s get back to the specific technical question that needs to be answered: if you embed RDFa in HTML5, does it parse into a usable DOM? If not, are there specific changes to the parsing specification that will allow it to parse to a usable DOM?
January 15th, 2009 at 2:51 pm
We need invalid HTML because otherwise making HTML6 will be even more difficult. It is important that authors do not use attribute and element names not minted by the HTML WG, so that the HTML WG can use them in the future without having to worry about compatibility with existing sites.
Of course, the HTML WG will have to care and will have to do research, but if we just make this some free-for-all game it will be a lot worse.
January 15th, 2009 at 2:51 pm
(The other reason for having invalid HTML is to aid authors by capturing authoring mistakes, et cetera. Forgot to mention that.)
January 15th, 2009 at 2:57 pm
Anne, that won’t imply any changes to the parsing, right? Only to the meaning/behavior of elements/attributes?
January 15th, 2009 at 3:03 pm
Right, but since that can have an impact on display or behavior, it might be that we cannot use those element or attribute names at all. (Also, because of various legacy issues with the HTML syntax, it is likely that future versions of HTML will make slight modifications to the parsing rules.)
January 15th, 2009 at 10:50 pm
Anne, the thing is that we already had “valid” vs “invalid” in HTML4… and when doing HTML5 we still ended up having to reverse-engineer the web in terms of what element and attribute names we could use. So I’m not sure having this concept of “invalid HTML” will help much with HTML6.
January 16th, 2009 at 3:07 am
“if you embed RDFa in HTML5, does it parse into a usable DOM? If not, are there specific changes to the parsing specification that will allow it to parse to a usable DOM?â€
My main objection to RDFa is that it parses do different DOMs in text/html and application/xhtml+xml. The group that defined RDFa could have easily avoided this but didn’t. I’m not sure why they did it that way, but it’s well known that they don’t exactly like HTML5. It isn’t good to allow groups inject features that violate HTML Design Principles and the software reuse mechanisms that have been developed for HTML/XHTML keeping us on the treadmill adjusting the parsing algorithm in ways that contradict existing browser behavior instead of the other groups making HTML-friendly specs to begin with.
January 16th, 2009 at 6:08 am
I think the QA issue is more that defining conforming markup in an appropriate way means that conforming byte streams will be parsed in unsurprising ways whereas non-conforming byte streams may be parsed in surprising ways or have surprising semantics. That makes conformance a useful concept. I don’t think that the HTML 5 specification ascribes it any particular value beyond that.
The problem with adding arbitrary attributes to HTML documents (e.g. to implement RDFa) is that they may have surprising semantics (either at present or, for example, in a future version of HTML). One of the purported reasons for supporting RDFa is that it allows unambiguous addition of metadata to documents (arguably the ay this is achieved is worse than the problem it tries to solve, but I digress). However if anyone is free to add their own attributes that conflict with the RDFa attribute names, but have different interpretation, that benefit is lost.
January 16th, 2009 at 7:54 am
There’s invalid markup, and then there’s invalid document trees.
The HTML5 spec defines what valid markup is, in the section about serializing HTML5, so the guidelines for serialization already exist. The spec defines how parsers must parse and how they have to treat invalid markup, so that a document tree will be created no matter what the markup is, and it will be the same document tree for all compliant user agents.
This doesn’t mean that it’s a valid document tree. A valid document tree contains only elements defined by the HTML spec, and elements from other namespaces (remember that there’s an XML serialization of the tree, too) where the spec says that they may appear. Further, the elements have only attributes defined for the element by the spec.
Now, I’m not sure if the spec defines how exactly an invalid document tree is to be handled.