OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Using entities for me dash problem

[ Lists Home | Date Index | Thread Index ]

Hi guys,

Okay, actually, the root of my misunderstanding comes from valid XML--is it
okay then, within a UTF-8 encoded XML document, to just type the em dash?
What I mean is, I thought that I had to use the NCR for valid XML. But is it
dependant on the encoding?

That UTF-8 can render the em dash directly--of course, this is the easy
answer, and now I understand the encoding part of XSL. Because the em-dash
can be rendered *as is* in UTF-8, if I specify that encoding, that's what
I'll get. And anything that knows how to render UTF-8 will do the same. So I
don't have to put the NCR data in. That, I now understand.

But is it still valid XML?

/johnny :)


On 9/12/03 7:07 AM, "Richard Tobin" <richard@cogsci.ed.ac.uk> wrote:

> In article <BBFB12B2.10A0%subscriber@pezagency.com>,
> JCS <subscriber@pezagency.com> wrote:
> 
>> Thanks, I'm aware of this, however, what I *don't* understand because it
>> makes no logical sense, is that if I'm transforming XML UTF-8 to UTF-8 my
>> declaration should be UTF-8, not something else. What I mean is, if I use
>> US-ASCII as an output method to preserve my NCR the declaration in the
>> output file should still be UTF-8, not ASCII, because it's NOT ascii but it
>> was *translated* using ASCII to preserve the NCR. Does this make sense? What
>> is it that *I'm* not getting, because it's very confusing.
> 
> If an XML file contains &#8212; this means Unicode character 8212,
> which as you know is em-dash.  The encoding of the file is irrelevant
> to this: the numbers in character references are always unicode code
> points regardless of the file encoding.  The encoding of the file
> determines how the characters "&", "#", "8", "2", "1", "2", and ";"
> are represented, not what they mean.
> 
> Once the XML document is parsed, there will just be the em-dash
> character, stored in the program's own internal encoding.  The fact
> that it was once represented by a numeric character reference is
> forgotten.
> 
> When the program comes to output the document, it will have to decide
> how to represent it.  If the output encoding can represent the
> character (as UTF-8 can), it will probably just output it directly.
> If it can't (as ASCII can't) it will have to use a character
> reference.  So you can force character references for non-ASCII
> characters by specifying ASCII as the output encoding.
> 
> Remember that this is just a trick.  If you really want to output the
> document as UTF-8, but with non-ASCII characters represented by
> character references, then you probably have to write your own code to
> do it.  But since ASCII is a subset of UTF-8, any ASCII XML document
> can be converted to a UTF-8 version just by manually editing the
> encoding declaration.
> 
> -- Richard
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
> 
> 
> 

-- 
"Religion is for people who are afraid they'll go to hell.
Spirituality is for people who have been there." 





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS