Lists Home |
Date Index |
Fair enough. I wasn't thinking at that level of round-tripping, which I
agree is problematic. What worried me about ERH's example was the
potential for not even being able to round-trip text -- an issue that
hasn't come up before (modulo entity references).
The problem is not limited just to values, such as would occur with
binary representations of real numbers. It also applies to formats.
Dates and numbers have multiple formats, some of which may inadvertently
For example, French geneological data might represent dates from the
Napoleonic period using the Napoleonic calendar; since this is how the
data is originally recorded, it should probably be continued to be
represented that way, even though these dates can be converted to modern
Similarly, a transcription of notes written by a criminal suspect might
include dates in a particular format. Since this format might be a clue
to the suspect's nationality or background, changing the format would
mean losing information.
Obviously, this additional information could be represented by
additional metadata. But it is naive to think that all document
designers will add such metadata.
Bob Foster wrote:
> Ronald Bourret wrote:
> > This points out something that should be a requirement for binary XML:
> > lossless roundtripping. In other words, you should be able to go from
> > the text serialization to the binary serialization and back losslessly
> > (within the confines of canonical XML). Same is true for binary <=>
> > text, binary <=> binary, and (of course) text <=> text.
> Of course text <=> text? This doesn't work today. I don't keep a list,
> but off the top of my head. Information in the text such as character
> references and internal general entity references in attribute values
> are removed by parsers (e.g., SAX) and are not available to write back
> out again. This is a perennial source of XSLT questions. Until SAX2
> Extensions 1.1, SAX didn't report the xml declaration, so the
> application didn't know the original encoding. The application couldn't
> tell which attribute values were specified in the document and which
> came from the DTD as defaults. As ERH points out, canonicalization loses
> the DOCTYPE declaration. And so on.