XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Dangers of Copying Text into an XML Document

At 2007-09-05 11:10 -0400, Costello, Roger L. wrote:
>I am compiling a list of well-formedness problems that may arise from
>copying text from one document and pasting it into an XML document.
>
>For example, consider this XML document:
>
><?xml version="1.0" encoding="UTF-8"?>
><Document>
>       <Para id="...">...</Para>
></Document>
>
>Suppose that text is copied from a document and pasted into the XML
>document, either as the content of the <Para> element

Use
<![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<Document>
       <Para id="...">...</Para>
</Document>
]]>

>or as the value of the id attribute.

Ouch!  You've got some character editing to do then ... you'll have 
to individually mark up your sensitive markup characters.

>Here is my current list of problems:
>
>1. The text may contain these reserved characters: {<, >, ', ", &}.
>These characters may introduce syntax errors into the XML document and
>may need to be escaped.

Not a problem with element content ... labourious with attribute values.

>2. The editor that was used to create the text may use a different
>encoding than the XML document's encoding. A binary string that
>represents a character in one encoding may represent a different
>character in another encoding.  Consequently, if the text was created
>in an editor that uses a different encoding than the XML document then
>the characters that result from pasting the text into the XML document
>may not be the same.

Usually the answer isn't related to either application's character 
encoding of the files ... if the application has appropriately 
created internally a set of Unicode characters when translating from 
the external document encoding, then the copy/paste functions between 
Unicode-aware applications will be working with the abstract Unicode 
character, only realizing a particular encoding when the application 
writes a file.

>Example: Word uses Windows-1252 encoding. The hex
>value for the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding
>the hex value for the left curly quote is x201C. In UTF-8 the hex value
>x93 corresponds to a control character.  Copying a left curly quote
>from a Word document and pasting it into a UTF-8 XML document may
>result in the XML document receiving a control character rather than a
>left curly quote.

This discussion came up in just the last few days.  Copying from Word 
to Notepad appeared to use the abstract characters and not the 
encoding sequences.

>Can you think of other problems that may result from copying text from
>one document and pasting it into an XML document?

"problems"?  I suppose it is just a matter of how XML-aware your 
application doing the pasting is.  If you are just using a simple 
text editor then you can't expect it to do much and the onus is on you.

If you work on the clipboard with Unicode characters then you should 
be insulated from encoding problems.

Pasting into element content and pasting into attribute values has 
different rules, so just be sensitive to the requirements.  CDATA is 
a handy way of doing it in element content.  The only characters you 
need to escape in attribute content are "<", "&", and whichever of 
the single or double quotes you use for your attribute literal 
delimiter ... the ">" and other quote do not have to be escaped.

Now if the XML content you have contains a CDATA section and you are 
pasting that into element content, you have to create two CDATA 
sections.  This is the challenge in a hands-on exercise in the XML 
class I deliver.

I hope this helps.

. . . . . . . . . . . Ken

--
Upcoming public training: XSLT/XSL-FO Sep 10, UBL/code lists Oct 1
World-wide corporate, govt. & user group XML, XSL and UBL training
RSS feeds:     publicly-available developer resources and training
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/m/
Box 266, Kars, Ontario CANADA K0A-2E0    +1(613)489-0999 (F:-0995)
Male Cancer Awareness Jul'07  http://www.CraneSoftwrights.com/m/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS