XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] "Interoperability is getting better" ... What doesthat mean?

Chris Maloney wrote:

> It used to be very common to see ... smart quotes 
> being rendered as, for example, “Good morning, Daveâ€

That is a fabulous example Chris!

> I don't have any data to back it up

Does anyone have data to back up the assertion that interoperability is getting better (with respect to encoding and decoding of characters)?

Interoperability is of great interest to me.

Below is a summary of our discussion. Please let me know of any mistakes.

-------------------------------------------------------------------------------
Interoperability of XML (i.e., Character Encoding Interoperability)
-------------------------------------------------------------------------------
Remember not long ago you would visit a web page and see strange characters like this:

    <Hal>“Good morning, Daveâ€</Hal>

You don't see that much anymore. 

Why?

The answer is this:

    Interoperability is getting better.

In the context of character encoding and decoding, what does that mean?

Interoperability means that you and I interpret (decode) the bytes in the same way.

Example: I create an XML file, encode all the characters in it using UTF-8, and send the XML file to you. 

Here is a graphical depiction (i.e., glyphs) of the bytes that I send to you:

    <Name>López</Name>

You receive my XML document and interpret the bytes as iso-8859-1. 

In UTF-8 the ó symbol is a graphical depiction of the "LATIN SMALL LETTER O WITH ACUTE" character and it is encoded using these two bytes: C3 B3

But in iso-8859-1, the two bytes C3 B3 is the encoding of two characters:

     C3 is the encoding of the à character
     B3 is the encoding of the ³ character

Thus you interpret the XML as:

    <Name>López</Name>

We are interpreting the same XML document (i.e., the same set of bytes) differently.

Interoperability has failed.

So when we say: 

    Interoperability is getting better.

we mean that the number of incidences of senders and receivers interpreting the same bytes differently is decreasing.  

Let's revisit our first example, you go to a web site and see this: 

     <Hal>“Good morning, Daveâ€</Hal>

Here's how that happened:

I use Microsoft Word (character set, Windows-1252) to create a web page containing this XML document:

    <Hal>“Good morning, Dave”</Hal>

Notice that I wrapped the greeting in Microsoft smart quotes. 

You visit my web page.

Suppose your browser is set to interpret all web pages as iso-8859-15.

In Windows-1252 the left smart quote is hex: 93

In Windows-1252 the right smart quote is hex: 84

In iso-8859-15 there are no characters assigned to either hex 93 or hex 84.  

So your browser replaces the left smart quote (hex 93) with hex E2 (â) followed by hex A4 (€) followed by hex BD (œ).

And your browser replaces the right smart quote (hex 84) with hex E2 (â) followed by hex A4 (€). 

The result is that you see this on your browser screen:

    <Hal>“Good morning, Daveâ€</Hal>

/Roger



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS