OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Quiz: How do you put a Euro sign in your data if your XML uses windows-1252 encoding and you use a numeric character reference?


On 2013 Mar 1, at 11:36, Michael Kay wrote:

>> I hinted at this months ago on this list that I believe the level of misunderstanding of encoding and Unicode concepts is both high and not self recognized.  Which is a deadly combination.
>> Is there more "the community" can do to make it clearer?
> If there is, please let me know.
> I've been advising people how to solve character encoding issues for about 100 years, but our own internal system for handling Saxon license requests still gets it wrong. It ain't easy.

For what it's worth, 1: Joel Spolsky's article on "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" <http://www.joelonsoftware.com/articles/Unicode.html> is quite good, I think.  It's (surely) been mentioned here before, but it might be worth mentioning again in this thread.

For what it's worth, 2: on the couple of occasions when I've had to explain 'unicode' to a colleague, it's been the notion of the abstract Unicode codepoints that's turned out key to the illumination.  The structure of my successful explanations has been something like this:

  * The Unicode consortium has (with much agonising and negotiation) managed to give a number to a large fraction of the characters in use.  These numbers are (jargon) called 'codepoints'.

  * 'The Letter A' has a codepoint, and this is independent of fonts.  Thus 'A' and 'a' have different codepoints, but roman, bold, italic, serif, sans serif (et very much cetera) are not distinguished.  Japanese kanji, tengwar and klingon characters (for example) have codepoints (this gets attention).

  * A 'unicode string' is (conceptually) a sequence of codepoints.  This is a sequence of mathematical integers.  It does not make sense to ask whether these are bytes, 2-byte or 4-byte words; the sequence has nothing to do with computers.

  * If you want to send that sequence of integers to someone, or save it on a computer disk, you have to do something to encode it.  You could also write down the sequence of numbers on a piece of paper, but let's specialise to computers at this point.  If you want to store or send this on a computer, you have to transform these integers into a sequence of bytes.  There are multiple procedures for doing that, and each of these procedures is named an 'encoding'.  One of these 'encodings' is UTF-8.

  * When you 'read a Unicode file', you are starting with a sequence of bytes, on disk, and conceptually ending up with a sequence of integers.  If the 'unicode file' is indicated, somehow, to be encoded in UTF-8, then you have to decode that sequence of bytes to get the sequence of integers.  All of the subsequent operations on the 'unicode string' are defined in terms of the sequence of codepoints, and the fact that it started off, on disk, as 'UTF-8' is forgotten.

The key point seems to me to be making it clear that 'UTF-8' is no more than a detail -- a necessary complication occasioned by the need to save the 'unicode string' to a disk.

Depending on audience, it takes more or fewer words than that.  But not much more, and I think that Spolsky's explanation is still longer than it has to be.  In any case, that ordering of points works for me.

Best wishes,


Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS