OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Using entities for me dash problem

[ Lists Home | Date Index | Thread Index ]

JCS wrote:


> Once again, you're arguing computer logic with me. Okay, yes, it's
> ASCII, fine. But I wanted to *preserve* the NCR and keep the
> declaration as UTF-8. That's a perfectly acceptable thing to ask for.
> Unfortunately, XSL does not allow me to preserve the NCR. It's
> "dumb".

Not really.  The NCR has disappeared before the xslt processor ever sees
it - the parser takes care of that.  Wherever there was an NCR, the
parser sticks the code  for the actual character into the data.  It
makes no record of the fact that an NCR had been used.

Now, certain parsers may let you intercept an NCR and do something when
one appears.  You could write your own handler that inserts the NCR text
instead of the actual character that it represents.

As to output, an xslt processor may choose to output an NCR depending on
the character and the encoding, but it would never know that the input
had originally contained and NCR.

> If anything, your above statement *proves* that the output method
> shouldn't be linked to the result declaration, because then the
> computer is assuming what the declaration should be based on how it
> was transformed. If the transformed result does not necessarily
> represent the declaration, I should have be able to change the
> declaration. In other words, if I've preserved the NCR for the sake
> of making the result UTF-8, then it shouldn't say US-ASCII just
> because I *had* to transform it due to the way the computer is
> programmed to encode these documents.

Not at all.  When the parser has done its parsing, knowledge of the
original encoding is not captured and sent to the xslt processor - it is
not part of the xpath model.  Remember, xml data is always unicode, even
when

The transformation is designed to act on the actual characters involved,
on not their encoded representation (because that is how xml processing
works).  Upon output, any implemented encoding may be selected.  So the
output encoding is independent of the input encoding.  The output 
encoding will match the _output_ encoding declaration.

> To make it simpler, if I want to preserve NCR, there should be an
> option without using ASCII encoding, or rather, I should be able to
> declare whatever encoding I wish the result to be, regardless of how
> the transformation was encoded.

You can specify the output encoding, as long as it has been implemented 
by the processor.  UTF-8 and UTF-16 are universal by specification.  You 
can usually get iso-8859-1 and often, us-ascii.  Otherwise, the 
available encodings are processor-dependent.

> 
> I think I've come to grips with the fact that it's illogical and
> output encoding should NOT be linked to the result declaration as
> they can be two different things.

Your perspective needs to be enlarged here.  An xml document can be 
assembled from pieces, each of which can have a different encoding.  The 
xslt stylesheet can be in a different encoding from the source document. 
  The stylesheet can import other stylesheets, and load other documents, 
all of which may be in arbitrary encodings.  The processor has to be 
able to handle them all.

So what should a processor count as "THE" encoding?  It is an impossible 
question to answer. Instead, all the encoded data gets decoded into a 
standard working format, which may be utf-8, utf-16, or whatever the 
processor uses.  All character references are taken by specification to 
mean their unicode characters, not characters in the encoding that was used.

Cheers,

Tom P





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS