[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Copying text (curly quotes) from Word into an XML document (UTF-8): what happens?
- From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- To: xml-dev@lists.xml.org
- Date: Sun, 02 Sep 2007 18:33:22 -0400
At 2007-09-02 22:46 +0100, Pete Kirkham wrote:
>Windows-1252 single byte character codes have curly quotes at 147 and
>148 decimal <http://en.wikipedia.org/wiki/Windows-1252>. These are
>10010011 and 10010100 binary.
>
>UTF-8 multibyte characters start with as many repeated 1s in their
>most significant bits as there are bytes in the sequence, then a zero,
>then data bits. For example, 147 would be 110/00010 10/010011 (slash
>splits control bits from data bits). UTF-8 single byte sequences
>always have 0 as most significant bit. So 10010011 cannot be a single
>byte UTF-8 character (msb is not zero) or the first byte of a
>multi-byte sequence (10 would indicate only one byte, which is not
>valid).
>
> > From: Costello, Roger L. [mailto:costello@mitre.org]
> > QUESTIONS
> >
> > 1. Is the curly quote a valid UTF-8 character?
>Yes, it has the byte sequence hex C2 93
1100 0010 1001 0011 in UTF-8 is Unicode U+0093 which is a control character:
Unicode character data base:
0093;<control>;Cc;0;BN;;;;;N;SET TRANSMIT STATE;;;;
The "right single quotation mark" is U+2019:
2019;RIGHT SINGLE QUOTATION MARK;Pf;0;ON;;;;;N;SINGLE COMMA
QUOTATION MARK;;;;
Which would translate into a number of UTF-8:
E2 80 99
1110 0010 1000 0000 1001 1001
BTW, you are asking here in the singular "curly quote" yet above you
are asking "curly quotes" ... the entries in Unicode are:
2018;LEFT SINGLE QUOTATION MARK;Pi;0;ON;;;;;N;SINGLE TURNED COMMA
QUOTATION MARK;;;;
2019;RIGHT SINGLE QUOTATION MARK;Pf;0;ON;;;;;N;SINGLE COMMA
QUOTATION MARK;;;;
201C;LEFT DOUBLE QUOTATION MARK;Pi;0;ON;;;;;N;DOUBLE TURNED COMMA
QUOTATION MARK;;;;
201D;RIGHT DOUBLE QUOTATION MARK;Pf;0;ON;;;;;N;DOUBLE COMMA
QUOTATION MARK;;;;
> > 2. Word uses Windows-1252 encoding, correct?
>Pass
You get your choice ... when you save a text file you can specify
"Other encoding" and select Unicode, UTF-8, UTF-7, or many others.
> > 3. The curly quote in Windows-1252 has a specific binary sequence, correct?
>Yes, hex 93
From the Wikipedia citation above, I see the following (though I
don't see formal character names, so I'm guessing these are the Unicode names):
hex 91 is left single quotation mark
hex 92 is right single quotation mark
hex 93 is left double quotation mark
hex 94 is right double quotation mark
> > 4. When I copy the curly quote from Word into Notepad, the operating
> > system does a straight 1-1 copy of the binary sequence, correct?
>I believe the encoding of data on the clipboard is indicated by a
>mechanism similar to mimetype and its up to source and target
>applications to set the data and interpret it correctly.
Pass. It depends if it is working in the abstract or not w.r.t. characters.
> > 5. When I copy the curly quote from Word into Notepad, there is no
> > conversion or translation of the binary sequence by the operating
> > system, correct?
>It's up to the application.
Pass. I thought the clipboard was Unicode based, so when you use the
word "copy" if you are using the clipboard I would assume it would
work. I just copied curly quotes from Word to Notepad and when
saving using UTF-8 I get the Unicode characters, and when saving to
"ANSI" I get Windows 1252 characters.
So you can experiment likewise with the clipboard and get these reults.
> > 6. Assuming the curly quote is a valid UTF-8 character, is the
> > Windows-1252 curly quote binary sequence the same as the UTF-8 curly
> > quote binary sequence?
>No.
Agree. As shown above, the binary sequence 9x is a control character
in Unicode and a displayable character in Windows-1252.
> > 7. Is the Windows-1252 curly quote binary sequence illegal in UTF-8,
> > i.e. the Windows-1252 curly quote binary sequence doesn't correspond to
> > any UTF-8 character?
>Yes.
UTF-8 isn't designed for Windows-1252 ... I think you are conflating
character sets with character encodings.
The test I just did appears to indicate the abstract character in
Windows 1252 position 146 is Unicode RIGHT SINGLE QUOTATION MARK as
that is what is saved as UTF-8 so it is translating it to the proper
Unicode value.
> > 8. Suppose I save the Word document as XML, and then I open the XML
> > using Notepad. The curly quotes no longer appear as curly quotes;
> > instead they appear as a bizarre character. Why does the curly quote
> > now look like a bizarre character in Notepad, whereas when I copied the
> > curly quote from Word to Notepad it looked fine in Notepad?
>Notepad doesn't understand UTF-8 encoded files.
False ... I just opened Notepad and wrote out a file using UTF-8 and
opened it up again and it was preserved. An XML processor read the
file and didn't complain about the encoding. I'm running XP.
I don't know a lot about Windows applications understanding of code
set 1252, but I think you need to be a bit more precise when talking
about characters in the abstract and their character encoding in
different encodings. Some simple experimentation should answer your
question with different applications, as I just did above with Word
and Notepad.
I hope this helps.
. . . . . . . . . . . . . Ken
--
Upcoming public training: XSLT/XSL-FO Sep 10, UBL/code lists Oct 1
World-wide corporate, govt. & user group XML, XSL and UBL training
RSS feeds: publicly-available developer resources and training
G. Ken Holman mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/x/
Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995)
Male Cancer Awareness Jul'07 http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]