XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] [Summary] Dangers of Copying Text into an XML Document


Excellent!  Thanks David.  

I have re-worded:

Example: Microsoft Word uses Windows-1252 encoding. The hex value for
the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding the left
curly quote is a three-byte sequence of hex codes xE2 x80 x9C, and
there is no character corresponding to hex value x93. Copying a left
curly quote from a Word document and pasting it into a UTF-8 XML
document results in the XML document receiving a byte sequence that
cannot be decoded as UTF-8. 

Is it stated accurately now?  If it is correct, I will update the
summary with this version.

/Roger

-----Original Message-----
From: David Carlisle [mailto:davidc@nag.co.uk] 
Sent: Thursday, September 06, 2007 12:45 PM
To: Costello, Roger L.
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] [Summary] Dangers of Copying Text into an XML
Document



> . In UTF-8 encoding the hex value for the left curly quote is x201C, 

No, that's the unicode value (in hex) but in utf8 the character is
represneted as a mult-byte sequence. (with the three bytes with hex
code
points E2 80 9C). 

The document should be careful to distinguish unicode from its
encodings as a sequence of bytes (since it is encoding errors that it
is
describing, mainly)

> Copying a left curly quote from a Word document and pasting it into a
> UTF-8 XML document may result in the XML document receiving an
illegal
> character.

that wording makes it sound as if you'd get the same sort of error as
if
you'd included a control character in the document, that is, a valid
unicode character that is not allowed in XML. What you'd get in this
case is a byte stream that could not be decoded using utf8, so there
would be no characters to pass to the XML parser at all.


David

http://people.w3.org/rishida/scripts/uniview/conversion

_______________________________________________________________________
_
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
_______________________________________________________________________
_


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS