Lists Home |
Date Index |
- From: Sean McGrath <email@example.com>
- To: firstname.lastname@example.org
- Date: Tue, 01 Aug 2000 15:44:20 +0100
>John Cowan wrote:
>> Character references are lost, it is true.
>> If you want them back, shout now.
At 21:56 01/08/00 +0800, Rick JELLIFFE wrote:
>Can I shout the opposite: "the fact that a character was entered
>directly or by reference should not be information available for any
>other specification or general-purpose application: it should not be
>part of the infoset."
>This is because the use of character references should be determined by
>its availability in the encoding used (and any user-supplied "kernel"
>encoding within that). XML should be defined using Unicode characters,
>not the markup that achieved the character.
Can I shout the opposite to this opposite!
This is a good case in point where the in/not-in dualism of the
OTI (One True Infoset) approach falls down. If character references
are not in the infoset then it is impossible to
write an XML parser based app that processes them.
The only way to process them would be to do so *lexically*.
In shifting to a lexical based algorithm you would need to
basically *re-write* an XML parser in order to be sure
that you were identifying character entity references correctly
Oh, sure you can write a regexp that will work "most of the time" but
try tell that to the client of the m-commerce/healthcare/rocket launching
XML application your are building.
http://www.pyxie.org - an Open Source XML Processing library for Python