OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: control characters

[ Lists Home | Date Index | Thread Index ]
  • From: Eldar Musayev <eldarm@microsoft.com>
  • To: xml-dev@xml.org
  • Date: Wed, 21 Jun 2000 16:47:13 -0700

Out of context, you are right. However, it is still using something already
allocated, because 0x0080-0x009F are different control characters than
0x0000-0x001F. And, by the way, in some applications characters like
"break-permitted-here" or "no-break-here" should be even more popular than
national encodings.
On the other hand in this sense my note was also out of context. Within the
context of original question, original proposal to use private use range
sounds the correct one.

 > -----Original Message-----
 > From: John Cowan [mailto:jcowan@reutershealth.com]
 > Sent: Wednesday, June 21, 2000 12:53 PM
 > To: Eldar Musayev; xml-dev@xml.org
 > Subject: Re: control characters
 > Eldar Musayev wrote:
 > > In a case you may be interested: there is a lot of 
 > charsets/encodings using
 > > this range as well.
 > Encodings using the *bytes* 0x7F to 0x9F aren't the issue.  
 > What counts here
 > is the Unicode *characters* U+007F to U+009F, which are 
 > solely the control
 > characters.
 > E.g. Win1252 uses 0x80 to encode EURO SIGN, but the 
 > corresponding Unicode
 > character is U+20AC, which is what counts for XML.

>The workaround I usually suggest is to represent control characters
>with (references to) characters from the Unicode private use range.
>This makes the necessary transformation a simple character
>substitution (which can even be just a subtraction - no need for a
>  -- Richard

Actually, as someone has already pointed out, 0x007F - 0x009F are fair game 
for XML documents, and Unicode has these defined as control character 

Mapping 0x0000 - 0x001F to the private use area sounds like the "correct" 
unicode thing to do, But for US-ASCII/UTF-8 documents I would map to 0x0080 
- 0x009F instead.
This way you preserve the deprecated anglo centric english-only bigoted 
assumption of 1 character == 1 byte.

The only downside is that someone might actually have data in this range. I 
think this is about as likely as someone having data in the private use 

XSLT will not _ALWAYS_ give you a perfect output format.
XML --> XSLT --> simple_text_filter seems like a win to me.

This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS