[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Couldn't illegal XML characters be used simply by escaping them?
- From: "Costello, Roger L." <costello@mitre.org>
- To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Sat, 10 Nov 2012 13:08:29 +0000
Hi Folks,
This week I was in a discussion and the topic of illegal XML characters came up and someone asked: "Couldn't illegal XML characters simply be escaped?"
Here is my response. Is it correct? Complete? Easy to understand?
We need to distinguish between a reserved XML character versus an illegal character.
The '<' symbol is a reserved XML character. If data contains that symbol it will confuse an XML Parser because the Parser will think, "Oh, a new element is being started."
For example, consider this:
<Equation>if A < B then ...</Equation>
That '<' symbol needs to be escaped. We can escape it using the built in < entity or the decimal or the hexadecimal value of the symbol. Let's do the latter:
<Equation>if A < B then ...</Equation>
Now the XML Parser is not confused into thinking that the XML is trying to start a new element. Note that the XML Parser does resolve the character entity reference and the output of the Parser is this:
<Equation>if A < B then ...</Equation>
We've made it past the Parser, so that '<' symbol no longer a problem.
An important thing to note is that the '<' symbol is (obviously) a legal character.
The XML 1.0 specification lists those characters that may be used in an XML document (see below for a partial list). So some characters cannot be used in XML documents. For example, hex 0 (null) is not a legal XML character.
[Person I was talking to] your suggestion is to escape illegal characters like so:
<Test> Here is a null character: �</Test>
What will an XML Parser do with that character entity reference? It will resolve it (let (null) represent the null character):
<Test> Here is a null character: (null)</Test>
But now the output of the XML Parser is an XML document that contains an illegal character. Thus an error is thrown.
Recap: reserved characters may be used where they ordinarily would cause confusion by escaping them. But illegal characters may never be used and escaping them does not help.
/Roger
Decimal value of
US-ASCII character | Is an XML character?
------------------------------------------
1 | No
2 | No
3 | No
4 | No
5 | No
6 | No
7 | No
8 | No
9 | Yes
10 | Yes
11 | No
12 | No
13 | Yes
14 | No
15 | No
16 | No
17 | No
18 | No
19 | No
20 | No
21 | No
22 | No
23 | No
24 | No
25 | No
26 | No
27 | No
28 | No
29 | No
30 | No
31 | No
32-127 | Yes
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]