[
Lists Home |
Date Index |
Thread Index
]
Hi guys,
Okay, actually, the root of my misunderstanding comes from valid XML--is it
okay then, within a UTF-8 encoded XML document, to just type the em dash?
What I mean is, I thought that I had to use the NCR for valid XML. But is it
dependant on the encoding?
That UTF-8 can render the em dash directly--of course, this is the easy
answer, and now I understand the encoding part of XSL. Because the em-dash
can be rendered *as is* in UTF-8, if I specify that encoding, that's what
I'll get. And anything that knows how to render UTF-8 will do the same. So I
don't have to put the NCR data in. That, I now understand.
But is it still valid XML?
/johnny :)
On 9/12/03 7:07 AM, "Richard Tobin" <richard@cogsci.ed.ac.uk> wrote:
> In article <BBFB12B2.10A0%subscriber@pezagency.com>,
> JCS <subscriber@pezagency.com> wrote:
>
>> Thanks, I'm aware of this, however, what I *don't* understand because it
>> makes no logical sense, is that if I'm transforming XML UTF-8 to UTF-8 my
>> declaration should be UTF-8, not something else. What I mean is, if I use
>> US-ASCII as an output method to preserve my NCR the declaration in the
>> output file should still be UTF-8, not ASCII, because it's NOT ascii but it
>> was *translated* using ASCII to preserve the NCR. Does this make sense? What
>> is it that *I'm* not getting, because it's very confusing.
>
> If an XML file contains — this means Unicode character 8212,
> which as you know is em-dash. The encoding of the file is irrelevant
> to this: the numbers in character references are always unicode code
> points regardless of the file encoding. The encoding of the file
> determines how the characters "&", "#", "8", "2", "1", "2", and ";"
> are represented, not what they mean.
>
> Once the XML document is parsed, there will just be the em-dash
> character, stored in the program's own internal encoding. The fact
> that it was once represented by a numeric character reference is
> forgotten.
>
> When the program comes to output the document, it will have to decide
> how to represent it. If the output encoding can represent the
> character (as UTF-8 can), it will probably just output it directly.
> If it can't (as ASCII can't) it will have to use a character
> reference. So you can force character references for non-ASCII
> characters by specifying ASCII as the output encoding.
>
> Remember that this is just a trick. If you really want to output the
> document as UTF-8, but with non-ASCII characters represented by
> character references, then you probably have to write your own code to
> do it. But since ASCII is a subset of UTF-8, any ASCII XML document
> can be converted to a UTF-8 version just by manually editing the
> encoding declaration.
>
> -- Richard
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
>
>
>
--
"Religion is for people who are afraid they'll go to hell.
Spirituality is for people who have been there."
|