XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Never mind the browser, let's do MicroXML

Interesting (and thanks for the civil reply - I've rather been making a stink of myself on this lately).

What I sense that you're saying is that while the parser will attempt to parse anything thrown at it, there is still a core set of parse rules that are independent of the underlying semantics of the language. Put another way, there is a set of well-formedness rules, but the role of the parser is to provide a guess, based upon its internal heuristics, as to which particular rules apply when it encounters non-well-formed content in order to turn it into well-formed content prior to rendering it. Or, to state it yet another way, if a creator knows the heuristics they could encode any content ... just that there are specific use cases in XML that would create a different parse tree in HTML5. Would you say this is correct?

Kurt Cagle
XML Architect
Lockheed / US National Archives ERA Project



On Fri, Dec 17, 2010 at 7:10 PM, David Carlisle <davidc@nag.co.uk> wrote:
On 17/12/2010 23:31, Kurt Cagle wrote:
   HTML5 has some problems but ambiguity isn't really one of them, the
   html5 spec specifies in excruciating detain how to construct a parse
   tree from any stream of unicode characters. Unlike XML there are no
   states equivalent to "not well formed", every input has a defined parse.

David,

Hmm .. I guess what I'm saying is this - suppose that you have an input
sequence that looks like this:

<html>
<body>
Text
<ul>
<li>Line 1
<li>Line 2
<li>Line 3

which you're implying could conceivably valid input.

Well actually it's invalid, the smallest changes I could make to make it valid would result in

<!DOCTYPE html>
<html>
<title></title>

<body>
Text
<ul>
<li>Line 1
<li>Line 2
<li>Line 3
</ul>





Because we know the underlying semantics, the processor would be able to
parse that as:

I'm not sure that semantics are required. the html5 spec says how to parse any input string it's a purely mechanical process with hardly any optional or customisable behaviour. (bit scary describing the html5 parser on a thread in which Henri is likely to pop up:-)



<html>
<body>Text
<ul>
<li>Line 1</li>
<li>Line 2</li>
<li>Line 3</li>
</ul>
</body>
</html>

However, without those known semantics, there are ambiguities in the
input - it could be interpreted as

well any input whether xml or html or fortran might be incorrect, not much you can do about that.


<garfle>
<fleeblock>Text</fleeblock>
<agbar/>
<lukvi>Line 1 <lukvi> Line 2 <lukvi> Line 3</lukvi></lukvi></lukvi>
</garfle>

acording to html5 that is non conforming (undefined element names) but has a defined parse tree of

<html><head></head><body><garfle>

<fleeblock>Text</fleeblock>
<agbar>
<lukvi>Line 1 <lukvi> Line 2 <lukvi> Line 3</lukvi></lukvi></lukvi>
</agbar></garfle>
</body></html>


or

<garfle>
<fleeblock>Text
<agbar>
<lukvi>Line 1</lukvi>
<lukvi>Line 2</lukvi>
<lukvi>Line 3</lukvi>
</agbar>
</bleeblock>
</garfle>

which again is non conforming but has a defined parse tree equivalent to parsing

<html><head></head><body><garfle>

<fleeblock>Text
<agbar>
<lukvi>Line 1</lukvi>
<lukvi>Line 2</lukvi>
<lukvi>Line 3</lukvi>
</agbar>

</fleeblock></garfle>
</body></html>


which may have very different interpretations based upon structure (I've
deliberately scrambled the words to highlight the issue). If that was a
known schema instance, it's that which I'm referring to in terms of
ambiguity. There may be specific parsing rules in HTML5, but I daresay
that anyone writing the initial instance I gave above probably wouldn't
be well versed on the specification.

If you write in any language without knowing the rules of that language, then confusion may result, but I don't think that can be called ambiguity in the language.


I think the difference in interpretation here is that the HTML5 focus is
on tolerating ambiguity (which is what supporting multiple rules for
parsing is)

I'm not sure what you mean by multiple rules. As you may have noticed, when James Clark and I suggested they could have some variation in the rules for newer documents the suggestion got a resounding no.


 and treating precision as a fault, while the XML focus is on
being willing to deal with the extra precision if it reduces ambiguity.
That's one of the reasons I get antsy when I hear people make statements
like the idea that HTML can replace XML. HTML+ARIA might have that
additional precision, but it comes at the cost of requiring two
languages plus coding to accomplish what can be done in one with XML.


David



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS