OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Storing illegal XML 1.0 characters in the Unicode Private Use Area

Hi Folks,

Here are the hex values for the Unicode characters that are permitted in XML 1.0 documents:

Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Notice that the hex values from E000 to F8FF are legal XML characters.

Interestingly, the hex values from E000 to F8FF have no characters assigned to them. That region is called the Private Use Area (PUA). 

Also notice that the hex values 0-8,B-C,F-1F are not legal XML 1.0 characters.

Suppose you are dealing with an application that emits text and some of the text contains characters that are illegal in XML 1.0. If you were to blindly wrap that text in markup and hand it to an XML parser, the parser would give an error saying that the document contains illegal characters.

So what do you do?

One approach is to move any illegal characters into the Private Use Area: for each illegal character add hex E000. Thus,

    map hex 0 to E000
    map hex 1 to E001
    map hex 2 to E002
    map hex 3 to E003
    map hex 1F to E01F

So this text (2 denotes hex two, 3 denotes hex three):

    2Hello World3

is converted to this XML:

   <text>&#xE002;Hello World&#xE003;</text>

Applications that process the XML document must be smart enough to subtract E000 from all the character entity references that are in the Private Use Area.

Interestingly, the Microsoft Visio application uses the approach described above [1].

    Any other ASCII control character between 
    ASCII 0 and ASCII 31 (excluding ASCII 9, 10, 
    and 13) is considered an illegal Unicode 
    character by some XML parsers. As a result, 
    these characters are translated into special 
    character values in the Unicode Private Use 
    Area. The Private Use Area begins at 0xE000. 
    ASCII control characters are offset by the 
    value 0xE000 when emitted to XML for Visio. 
    Therefore, if a Visio shape's text contained 
    the character ASCII 11 (Hex 0x0B), it is 
    emitted as 0xE00B.


[1] http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS