The expression-oriented thinking practised in XML technology stops
abruptly at the border provided by XML syntax. Differences of encoding,
quote character, use of entities, etc. are abstracted away and defined
to be irrelevant to the information content - as long as the text in
question is XML. But HTML is "something else", not XML. The standards
will not allow to parse an HTML document into a node tree. The prevalent
thinking seems to be that text resources defined to encode node trees
must be XML text. Is there a good reason, apart from inertia of habit?
I'm not sure that border is particularly sharp, even with inertia of
habit. There was XHTML, and even with the (not my favorite) parsing
specified by HTML5, there's definitely a clear path to a node tree.