[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Storing illegal XML 1.0 characters in the Unicode Private Use Area
- From: "Costello, Roger L." <costello@mitre.org>
- To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Wed, 31 Oct 2012 18:04:33 +0000
Hi Folks,
Here are the hex values for the Unicode characters that are permitted in XML 1.0 documents:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Notice that the hex values from E000 to F8FF are legal XML characters.
Interestingly, the hex values from E000 to F8FF have no characters assigned to them. That region is called the Private Use Area (PUA).
Also notice that the hex values 0-8,B-C,F-1F are not legal XML 1.0 characters.
Suppose you are dealing with an application that emits text and some of the text contains characters that are illegal in XML 1.0. If you were to blindly wrap that text in markup and hand it to an XML parser, the parser would give an error saying that the document contains illegal characters.
So what do you do?
One approach is to move any illegal characters into the Private Use Area: for each illegal character add hex E000. Thus,
map hex 0 to E000
map hex 1 to E001
map hex 2 to E002
map hex 3 to E003
...
map hex 1F to E01F
So this text (2 denotes hex two, 3 denotes hex three):
2Hello World3
is converted to this XML:
<text>Hello World</text>
Applications that process the XML document must be smart enough to subtract E000 from all the character entity references that are in the Private Use Area.
Interestingly, the Microsoft Visio application uses the approach described above [1].
Any other ASCII control character between
ASCII 0 and ASCII 31 (excluding ASCII 9, 10,
and 13) is considered an illegal Unicode
character by some XML parsers. As a result,
these characters are translated into special
character values in the Unicode Private Use
Area. The Private Use Area begins at 0xE000.
ASCII control characters are offset by the
value 0xE000 when emitted to XML for Visio.
Therefore, if a Visio shape's text contained
the character ASCII 11 (Hex 0x0B), it is
emitted as 0xE00B.
/Roger
[1] http://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]