[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Dangers of Copying Text into an XML Document
- From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- To: <xml-dev@lists.xml.org>
- Date: Wed, 05 Sep 2007 11:43:34 -0400
At 2007-09-05 11:10 -0400, Costello, Roger L. wrote:
>I am compiling a list of well-formedness problems that may arise from
>copying text from one document and pasting it into an XML document.
>
>For example, consider this XML document:
>
><?xml version="1.0" encoding="UTF-8"?>
><Document>
> <Para id="...">...</Para>
></Document>
>
>Suppose that text is copied from a document and pasted into the XML
>document, either as the content of the <Para> element
Use
<![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Para id="...">...</Para>
</Document>
]]>
>or as the value of the id attribute.
Ouch! You've got some character editing to do then ... you'll have
to individually mark up your sensitive markup characters.
>Here is my current list of problems:
>
>1. The text may contain these reserved characters: {<, >, ', ", &}.
>These characters may introduce syntax errors into the XML document and
>may need to be escaped.
Not a problem with element content ... labourious with attribute values.
>2. The editor that was used to create the text may use a different
>encoding than the XML document's encoding. A binary string that
>represents a character in one encoding may represent a different
>character in another encoding. Consequently, if the text was created
>in an editor that uses a different encoding than the XML document then
>the characters that result from pasting the text into the XML document
>may not be the same.
Usually the answer isn't related to either application's character
encoding of the files ... if the application has appropriately
created internally a set of Unicode characters when translating from
the external document encoding, then the copy/paste functions between
Unicode-aware applications will be working with the abstract Unicode
character, only realizing a particular encoding when the application
writes a file.
>Example: Word uses Windows-1252 encoding. The hex
>value for the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding
>the hex value for the left curly quote is x201C. In UTF-8 the hex value
>x93 corresponds to a control character. Copying a left curly quote
>from a Word document and pasting it into a UTF-8 XML document may
>result in the XML document receiving a control character rather than a
>left curly quote.
This discussion came up in just the last few days. Copying from Word
to Notepad appeared to use the abstract characters and not the
encoding sequences.
>Can you think of other problems that may result from copying text from
>one document and pasting it into an XML document?
"problems"? I suppose it is just a matter of how XML-aware your
application doing the pasting is. If you are just using a simple
text editor then you can't expect it to do much and the onus is on you.
If you work on the clipboard with Unicode characters then you should
be insulated from encoding problems.
Pasting into element content and pasting into attribute values has
different rules, so just be sensitive to the requirements. CDATA is
a handy way of doing it in element content. The only characters you
need to escape in attribute content are "<", "&", and whichever of
the single or double quotes you use for your attribute literal
delimiter ... the ">" and other quote do not have to be escaped.
Now if the XML content you have contains a CDATA section and you are
pasting that into element content, you have to create two CDATA
sections. This is the challenge in a hands-on exercise in the XML
class I deliver.
I hope this helps.
. . . . . . . . . . . Ken
--
Upcoming public training: XSLT/XSL-FO Sep 10, UBL/code lists Oct 1
World-wide corporate, govt. & user group XML, XSL and UBL training
RSS feeds: publicly-available developer resources and training
G. Ken Holman mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/m/
Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995)
Male Cancer Awareness Jul'07 http://www.CraneSoftwrights.com/m/bc
Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]