[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Storing illegal XML 1.0 characters in the Unicode PrivateUse Area
- From: Julian Reschke <julian.reschke@gmx.de>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Fri, 02 Nov 2012 15:39:18 +0100
On 2012-10-31 19:04, Costello, Roger L. wrote:
> Hi Folks,
>
> Here are the hex values for the Unicode characters that are permitted in XML 1.0 documents:
>
> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>
> Notice that the hex values from E000 to F8FF are legal XML characters.
>
> Interestingly, the hex values from E000 to F8FF have no characters assigned to them. That region is called the Private Use Area (PUA).
>
> Also notice that the hex values 0-8,B-C,F-1F are not legal XML 1.0 characters.
>
> Suppose you are dealing with an application that emits text and some of the text contains characters that are illegal in XML 1.0. If you were to blindly wrap that text in markup and hand it to an XML parser, the parser would give an error saying that the document contains illegal characters.
>
> So what do you do?
>
> One approach is to move any illegal characters into the Private Use Area: for each illegal character add hex E000. Thus,
>
> map hex 0 to E000
> map hex 1 to E001
> map hex 2 to E002
> map hex 3 to E003
> ...
> map hex 1F to E01F
>
> So this text (2 denotes hex two, 3 denotes hex three):
>
> 2Hello World3
>
> is converted to this XML:
>
> <text>Hello World</text>
>
> Applications that process the XML document must be smart enough to subtract E000 from all the character entity references that are in the Private Use Area.
>
> Interestingly, the Microsoft Visio application uses the approach described above [1].
>
> Any other ASCII control character between
> ASCII 0 and ASCII 31 (excluding ASCII 9, 10,
> and 13) is considered an illegal Unicode
> character by some XML parsers. As a result,
> these characters are translated into special
> character values in the Unicode Private Use
> Area. The Private Use Area begins at 0xE000.
> ASCII control characters are offset by the
> value 0xE000 when emitted to XML for Visio.
> Therefore, if a Visio shape's text contained
> the character ASCII 11 (Hex 0x0B), it is
> emitted as 0xE00B.
>
> /Roger
>
> [1] http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx
> ...
Also used in
<http://www.day.com/specs/jcr/2.0/3_Repository_Model.html#3.2.5.4%20Exposing%20Non-JCR%20Names>.
Best regards, Julian
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]