OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Copying text (curly quotes) from Word into an XML document (UTF-8): what happens?

At 2007-09-02 22:46 +0100, Pete Kirkham wrote:
>Windows-1252 single byte character codes have curly quotes at 147 and
>148 decimal  <http://en.wikipedia.org/wiki/Windows-1252>. These are
>10010011 and 10010100 binary.
>UTF-8 multibyte characters start with as many repeated 1s in their
>most significant bits as there are bytes in the sequence, then a zero,
>then data bits. For example, 147 would be 110/00010 10/010011 (slash
>splits control bits from data bits). UTF-8 single byte sequences
>always have 0 as most significant bit. So 10010011 cannot be a single
>byte UTF-8 character (msb is not zero) or the first byte of a
>multi-byte sequence (10 would indicate only one byte, which is not
> > From: Costello, Roger L. [mailto:costello@mitre.org]
> >
> > 1. Is the curly quote a valid UTF-8 character?
>Yes, it has the byte sequence hex C2 93

1100 0010 1001 0011 in UTF-8 is Unicode U+0093 which is a control character:

  Unicode character data base:
  0093;<control>;Cc;0;BN;;;;;N;SET TRANSMIT STATE;;;;

The "right single quotation mark" is U+2019:


Which would translate into a number of UTF-8:

E2 80 99

1110 0010 1000 0000 1001 1001

BTW, you are asking here in the singular "curly quote" yet above you 
are asking "curly quotes" ... the entries in Unicode are:


> > 2. Word uses Windows-1252 encoding, correct?

You get your choice ... when you save a text file you can specify 
"Other encoding" and select Unicode, UTF-8, UTF-7, or many others.

> > 3. The curly quote in Windows-1252 has a specific binary sequence, correct?
>Yes, hex 93

 From the Wikipedia citation above, I see the following (though I 
don't see formal character names, so I'm guessing these are the Unicode names):

hex 91 is left single quotation mark
hex 92 is right single quotation mark
hex 93 is left double quotation mark
hex 94 is right double quotation mark

> > 4. When I copy the curly quote from Word into Notepad, the operating
> > system does a straight 1-1 copy of the binary sequence, correct?
>I believe the encoding of data on the clipboard is indicated by a
>mechanism similar to mimetype and its up to source and target
>applications to set the data and interpret it correctly.

Pass.  It depends if it is working in the abstract or not w.r.t. characters.

> > 5. When I copy the curly quote from Word into Notepad, there is no
> > conversion or translation of the binary sequence by the operating
> > system, correct?
>It's up to the application.

Pass.  I thought the clipboard was Unicode based, so when you use the 
word "copy" if you are using the clipboard I would assume it would 
work.  I just copied curly quotes from Word to Notepad and when 
saving using UTF-8 I get the Unicode characters, and when saving to 
"ANSI" I get Windows 1252 characters.

So you can experiment likewise with the clipboard and get these reults.

> > 6. Assuming the curly quote is a valid UTF-8 character, is the
> > Windows-1252 curly quote binary sequence the same as the UTF-8 curly
> > quote binary sequence?

Agree.  As shown above, the binary sequence 9x is a control character 
in Unicode and a displayable character in Windows-1252.

> > 7. Is the Windows-1252 curly quote binary sequence illegal in UTF-8,
> > i.e. the Windows-1252 curly quote binary sequence doesn't correspond to
> > any UTF-8 character?

UTF-8 isn't designed for Windows-1252 ... I think you are conflating 
character sets with character encodings.

The test I just did appears to indicate the abstract character in 
Windows 1252 position 146 is Unicode RIGHT SINGLE QUOTATION MARK as 
that is what is saved as UTF-8 so it is translating it to the proper 
Unicode value.

> > 8. Suppose I save the Word document as XML, and then I open the XML
> > using Notepad. The curly quotes no longer appear as curly quotes;
> > instead they appear as a bizarre character.  Why does the curly quote
> > now look like a bizarre character in Notepad, whereas when I copied the
> > curly quote from Word to Notepad it looked fine in Notepad?
>Notepad doesn't understand UTF-8 encoded files.

False ... I just opened Notepad and wrote out a file using UTF-8 and 
opened it up again and it was preserved.  An XML processor read the 
file and didn't complain about the encoding.  I'm running XP.

I don't know a lot about Windows applications understanding of code 
set 1252, but I think you need to be a bit more precise when talking 
about characters in the abstract and their character encoding in 
different encodings.  Some simple experimentation should answer your 
question with different applications, as I just did above with Word 
and Notepad.

I hope this helps.

. . . . . . . . . . . . . Ken

Upcoming public training: XSLT/XSL-FO Sep 10, UBL/code lists Oct 1
World-wide corporate, govt. & user group XML, XSL and UBL training
RSS feeds:     publicly-available developer resources and training
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/x/
Box 266, Kars, Ontario CANADA K0A-2E0    +1(613)489-0999 (F:-0995)
Male Cancer Awareness Jul'07  http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS