OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Copying text (curly quotes) from Word into an XML document (UTF-8): what happens?

Windows-1252 single byte character codes have curly quotes at 147 and
148 decimal  <http://en.wikipedia.org/wiki/Windows-1252>. These are
10010011 and 10010100 binary.

UTF-8 multibyte characters start with as many repeated 1s in their
most significant bits as there are bytes in the sequence, then a zero,
then data bits. For example, 147 would be 110/00010 10/010011 (slash
splits control bits from data bits). UTF-8 single byte sequences
always have 0 as most significant bit. So 10010011 cannot be a single
byte UTF-8 character (msb is not zero) or the first byte of a
multi-byte sequence (10 would indicate only one byte, which is not

> From: Costello, Roger L. [mailto:costello@mitre.org]
> 1. Is the curly quote a valid UTF-8 character?
Yes, it has the byte sequence hex C2 93

> 2. Word uses Windows-1252 encoding, correct?

> 3. The curly quote in Windows-1252 has a specific binary sequence, correct?
Yes, hex 93

> 4. When I copy the curly quote from Word into Notepad, the operating
> system does a straight 1-1 copy of the binary sequence, correct?
I believe the encoding of data on the clipboard is indicated by a
mechanism similar to mimetype and its up to source and target
applications to set the data and interpret it correctly.

> 5. When I copy the curly quote from Word into Notepad, there is no
> conversion or translation of the binary sequence by the operating
> system, correct?
It's up to the application.

> 6. Assuming the curly quote is a valid UTF-8 character, is the
> Windows-1252 curly quote binary sequence the same as the UTF-8 curly
> quote binary sequence?

> 7. Is the Windows-1252 curly quote binary sequence illegal in UTF-8,
> i.e. the Windows-1252 curly quote binary sequence doesn't correspond to
> any UTF-8 character?

> 8. Suppose I save the Word document as XML, and then I open the XML
> using Notepad. The curly quotes no longer appear as curly quotes;
> instead they appear as a bizarre character.  Why does the curly quote
> now look like a bizarre character in Notepad, whereas when I copied the
> curly quote from Word to Notepad it looked fine in Notepad?
Notepad doesn't understand UTF-8 encoded files.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS