xml-dev - Re: [xml-dev] Question about UTF-8

Re: [xml-dev] Question about UTF-8

[ Lists Home | Date Index | Thread Index ]

To: Gustaf Liljegren <gustaf.liljegren@xml.se>
Subject: Re: [xml-dev] Question about UTF-8
From: Rick Jelliffe <ricko@allette.com.au>
Date: Fri, 29 Aug 2003 16:39:18 +1000
Cc: xml-dev@lists.xml.org
In-reply-to: <3.0.6.32.20030828190252.018476c0@m1.858.telia.com>
References: <3.0.6.32.20030828161218.00e17738@m1.858.telia.com> <3.0.6.32.20030828161218.00e17738@m1.858.telia.com> <3.0.6.32.20030828190252.018476c0@m1.858.telia.com>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3.1) Gecko/20030428

Gustaf Liljegren wrote:

>In an XML-aware editor, yes. But the question is about general
>('non-XML-aware') text editors. A general editor has no idea of the
>encoding detection mechanism in XML, so I wonder how it knows that the
>octets C3 A4 should be written 'ä' and not 'Ã¤' (or something else).
>
Operating systems (or, if you are lucky, particular user sessions) have 
a setting called "locale".
Among other things, this sets the default character encoding used for 
processing text.

For example, in Java when you open a stream and don't specify an 
encoding, Java uses the
locale's default encoding. On West Western PCs (English-speaking 
countries and their
neighbours) this encoding will be CP1252, a superset of ISO 8859-1. 

However, on older Macs, it may be MacRoman, which is different. On newer 
Macs and Linux
it may be ISO 8859-15, which is slightly different again. Many modern 
text editors understand
the Byte Order Mark that UTF-16 allows. 

>Many users who see 'Ã¤' when they open a UTF-8 encoded XML document in a
>text editor, prefer to use ISO 8859-1 to avoid this effect.
>
You are right that if you use an encoding that the text editor does not 
understand, the
results will not be satisfactory. Worse than nasty glyphs, you may find 
that your data
is actually corrupted. Or you can find that some parts of an entity are 
in one encoding
and some other sections are in another. Unfortunately, people have this 
idea that all
"text editors" will be able to edit all "text": but there is no such 
beast as "text"--it is
always "text in a particular encoding".

XML allows you to alter the encoding to suit your tools.  Encoding isn't 
important, within reason.
If one set of tools works best with a particular encoding, transcode 
your data to use that
encoding. And if you are really worried, use character entities such as 
&auml; to prevent
stuff-ups.  You should be free to change encodings* because XML forces 
you to label
which encoding has been used; that way there can to be no 
ambiguity--which is not to
say that there will be no confusion as you figure out which is the 
appropriate encoding
for your particular toolset.

>Maybe the answer is to stay in ISO 8859-1 (or whatever default encoding the
>editor has), but I was hoping it was possible to recommend using UTF-8 all
>the time (for European scripts).
>  
>
Modern editors allow the user to select the encoding used.  Some 
editors, <plug>such as
Topologi's</plug>, have XML encoding detection built-in, but over-ride-able.
Perhaps your people should consider moving away from non-Unicode based 
text editors.

When XML was being developed, many people just wanted to use UTF-8/UTF-16
and to ignore "legacy encodings" and "legacy systems".  I had expected that
by 2002 Unicode would be so entrenched that other encodings (in the 
West) would
be relatively unimportant; however it seems that (especially for the 
Linux world,
and also the PC world it seems) the legacy applications are still very much
alive and kicking.

You might think "wouldn't it be simpler if unlabelled XML just used my 
system's
default encoding?" Well, how would that work unless there is someone at the
receiving end to check that the encoding you used iss the same as 
theirs? Ordinary
users don't have the ability to check encodings, especially with any kind of
large document, and often the receiving end may be a computer.  It is
much simpler to state what the encoding used is rather than to have some
guessing system...especially given that encoding is not always guessable,
especially for performance reasons.  

Cheers
Rick Jelliffe

http://www.topologi.com

* providing the characters you have used are in both character sets

References:
- Question about UTF-8
  - From: Gustaf Liljegren <gustaf.liljegren@xml.se>
- Re: [xml-dev] Question about UTF-8
  - From: Gustaf Liljegren <gustaf.liljegren@xml.se>

Prev by Date: Re: [xml-dev] Re: XML and the Relational Model
Next by Date: XML storage
Previous by thread: Re: [xml-dev] Question about UTF-8
Next by thread: Re: Question about UTF-8
Index(es):
- Date
- Thread