Lists Home |
Date Index |
- From: Rick JELLIFFE <firstname.lastname@example.org>
- To: email@example.com
- Date: Wed, 02 Aug 2000 00:25:29 +0800
Sean McGrath wrote:
> >John Cowan wrote:
> >> Character references are lost, it is true.
> >> If you want them back, shout now.
> At 21:56 01/08/00 +0800, Rick JELLIFFE wrote:
> >Can I shout the opposite: "the fact that a character was entered
> >directly or by reference should not be information available for any
> >other specification or general-purpose application: it should not be
> >part of the infoset."
> This is a good case in point where the in/not-in dualism of the
> OTI (One True Infoset) approach falls down. If character references
> are not in the infoset then it is impossible to
> write an XML parser based app that processes them.
Yes. This is a great thing.
> The only way to process them would be to do so *lexically*.
> In shifting to a lexical based algorithm you would need to
> basically *re-write* an XML parser in order to be sure
> that you were identifying character entity references correctly
> every time.
You couldn't do it reliably: you could only guess based on some other
out-of band information. (Such as a "character collection"
> Oh, sure you can write a regexp that will work "most of the time" but
> try tell that to the client of the m-commerce/healthcare/rocket launching
> XML application your are building.
I don't understand this point at all. If the infoset contained only
resolved characters, then any regexp on the XML-parsed string
(normalization issues aside) will always work the same every time. If
you say that a
character reference is a part of the infoset, that will suggest that you
want the defult behaviour of applications to be to preserve them: that
not robust because no application has been built with this in mind. And
it means that you want the presence of a character reference to signify
processing instruction or semantic, it is tag abuse: use a PI or
entityref or element. Furthermore, it suggests that you think that
preservation of character
references should be the default behaviour for round-tripping
however I expect that the the default behaviour of XML generating
routines will be to generate something closer to c14nized XML as well as
to perform Unicode early normalization. Finally, it would introduce
incompatabilities into something that all systems agree on currently (as
In SGML days, the first thing we did on data coming (after making sure
it validated somehow) was to normalize it, so that all tags were
explicit and all characters represented in the same way, either as
direct characters or references.
XML has reduced the need for data normalization because it is fully
tagged. But if you have problems with data coming from different sources
with different referencing conventions, the last thing you would want
would be for references to be preserved in the infoset or for it not to
be easy to write a data normalizer.