Re: [xml-dev] Minimal set of rules for making HTML well-formed?

On Fri, 11 Oct 2019 at 17:22, Costello, Roger L. <costello@mitre.org> wrote:

Hi Folks,

Last week there was a suggestion to use SgmlReader to convert HTML to XHTML. After some experimenting, I discovered that SgmlReader has some problems. See below for one such problem.

That looks like user error, with sgml (or xml) you would need to use a catalogue to supply a dtd that defines & nbsp; to be & #160;

I’ve decided to implement my own tool to convert HTML to XHTML. I want the tool to make the minimal amount of changes to the HTML -- just make the changes necessary to make the HTML well-formed. As I see it, there are only 4 things that need to be done to make the HTML well-formed:

Ensure that attribute values are delimited with either double or single quotes.
Ensure that every start tag has a matching end tag.
Ensure that elements are properly nested.
Ensure that the XML reserved characters ( <, >, &, ', ") are escaped when used in data.

Am I missing anything? If my tool does those 4 things am I guaranteed that the resulting document will be well-formed? /Roger

This is a black hole that you might not want to approach see any of the 10000s of online discussions surrounding

https://www.w3.org/TR/html-polyglot/

:-)

The main problem that killed the polyglot idea is that making an html document that parses as well formed xhtml isn't enough in practice (you also want it to mean/render the same) but the requirement used in that document that the DOM trees resulting from an html or xml parse are the same, is very hard to achieve.

1 and 2 are rather hard to achieve without using an html parser on the front end (and if you have one of those, just dumping a serialisation of its dom tree will give xml (er except when it doesn't:-)

the html parse _always_ gives a well defined result so just knowing what are the tags isn't easy with an arbitrary tool not using the html parser, and the tag names may not be valid xml names

I just typed in this at random

https://www.software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%0A%3Caaa%20a%20b%3Dk%20%3Cbb%3D!!%3Caaaa%3Ca%3E%3Cccc%20href%3D%22ss%22%3Ezzz%3C%2Fccc%3E%0A

and it shows for example an attribute with name <kk that can't be expressed in xml.

You might be tempted to say that's not valid input but again defining (and automatically checking ) what is valid really needs an html parser.

3) again yes so long as you know what the elements are and all the special rules around <script> etc.

4) of the characters that you mention only < and & need to be escaped in character data, ' and " do not, but you also need to escape any use of ]]> in character data, and use of control characters which are not allowed in xml 1.0 (actually you can't just escape the control characters, you need to remove them or encode them in some other way)

David