[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Dangers of Copying Text into an XML Document
- From: "Costello, Roger L." <costello@mitre.org>
- To: <xml-dev@lists.xml.org>
- Date: Wed, 5 Sep 2007 11:10:39 -0400
Hi Folks,
I am compiling a list of well-formedness problems that may arise from
copying text from one document and pasting it into an XML document.
For example, consider this XML document:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Para id="...">...</Para>
</Document>
Suppose that text is copied from a document and pasted into the XML
document, either as the content of the <Para> element or as the value
of the id attribute.
Here is my current list of problems:
1. The text may contain these reserved characters: {<, >, ', ", &}.
These characters may introduce syntax errors into the XML document and
may need to be escaped.
2. The editor that was used to create the text may use a different
encoding than the XML document's encoding. A binary string that
represents a character in one encoding may represent a different
character in another encoding. Consequently, if the text was created
in an editor that uses a different encoding than the XML document then
the characters that result from pasting the text into the XML document
may not be the same. Example: Word uses Windows-1252 encoding. The hex
value for the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding
the hex value for the left curly quote is x201C. In UTF-8 the hex value
x93 corresponds to a control character. Copying a left curly quote
from a Word document and pasting it into a UTF-8 XML document may
result in the XML document receiving a control character rather than a
left curly quote.
Can you think of other problems that may result from copying text from
one document and pasting it into an XML document?
/Roger
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]