OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Syntax Sugar and XML information models

> > Conceptually, perhaps we have:
> >
> > The "Syntax Sugar InfoSet" (SSIS) that exposes everything worth
> > round-tripping
> > in the XML syntax... [even different quote characters
> and whitespace???]
> That list could be endless - you did not even mention attribute order.

Well, that's the nub of the issue here:  The W3C InfoSet is widely
interpreted as decreeing that everything not in the InfoSet is "mere syntax
sugar". Some of these distinctions are clearly rooted in the XML spec and
existing practice, such as the fact that the order of attributes is
insignificant, the type of quotation marks around attribute values is
insignificant, etc.  Others are more controversial, such as CDATA sections.
[For example, would you really want your XML database to take in XML
documents with scripts escaped with CDATA sections and return them escaped
with < etc.?]
Others really MUST be interpreted differently by authoring tools than the
InfoSet specifies -- for example, the whole POINT of parsed entities is lost
if an editor doesn't round-trip them; likewise a database should either let
its client resolve external entities, or resolve them at retrieval time
rather than storage time.  (Entities are the only thing supported in a
Recommendation that enable control of redundant information ...).

So, there seem to be two classes of things that the InfoSet doesn't cover:
the "mere syntax" that no reasonable application (except maybe a "diff")
would care about, and the gray area stuff that some XML tools must care
about but that the InfoSet says nothing about.  My suggestion is to make
this distinction more
formally, based on input from the folks "in the trenches" about which
details of XML syntax are "significant" and which aren't.  Maybe there is an
endless list of things that some people care about and some don't, but I'd
at least like to see some discussion before giving up.

So, does ANYBODY care about round-tripping a) the specific quote characters
around attribute values, b) the order of attributes; c) character entity
references for characters that are in the specified character set d) the two
diferent syntaxes for empty elements, .... ?  Are there other bits that the
InfoSet doesn't represent but have some practical significance to real
applications? (Let's not discuss whitespace ... the complexities there are
well-known and too painful to think about).