Lists Home |
Date Index |
Gustaf Liljegren wrote:
>In an XML-aware editor, yes. But the question is about general
>('non-XML-aware') text editors. A general editor has no idea of the
>encoding detection mechanism in XML, so I wonder how it knows that the
>octets C3 A4 should be written 'ä' and not 'Ã¤' (or something else).
Operating systems (or, if you are lucky, particular user sessions) have
a setting called "locale".
Among other things, this sets the default character encoding used for
For example, in Java when you open a stream and don't specify an
encoding, Java uses the
locale's default encoding. On West Western PCs (English-speaking
countries and their
neighbours) this encoding will be CP1252, a superset of ISO 8859-1.
However, on older Macs, it may be MacRoman, which is different. On newer
Macs and Linux
it may be ISO 8859-15, which is slightly different again. Many modern
text editors understand
the Byte Order Mark that UTF-16 allows.
>Many users who see 'Ã¤' when they open a UTF-8 encoded XML document in a
>text editor, prefer to use ISO 8859-1 to avoid this effect.
You are right that if you use an encoding that the text editor does not
results will not be satisfactory. Worse than nasty glyphs, you may find
that your data
is actually corrupted. Or you can find that some parts of an entity are
in one encoding
and some other sections are in another. Unfortunately, people have this
idea that all
"text editors" will be able to edit all "text": but there is no such
beast as "text"--it is
always "text in a particular encoding".
XML allows you to alter the encoding to suit your tools. Encoding isn't
important, within reason.
If one set of tools works best with a particular encoding, transcode
your data to use that
encoding. And if you are really worried, use character entities such as
ä to prevent
stuff-ups. You should be free to change encodings* because XML forces
you to label
which encoding has been used; that way there can to be no
ambiguity--which is not to
say that there will be no confusion as you figure out which is the
for your particular toolset.
>Maybe the answer is to stay in ISO 8859-1 (or whatever default encoding the
>editor has), but I was hoping it was possible to recommend using UTF-8 all
>the time (for European scripts).
Modern editors allow the user to select the encoding used. Some
editors, <plug>such as
Topologi's</plug>, have XML encoding detection built-in, but over-ride-able.
Perhaps your people should consider moving away from non-Unicode based
When XML was being developed, many people just wanted to use UTF-8/UTF-16
and to ignore "legacy encodings" and "legacy systems". I had expected that
by 2002 Unicode would be so entrenched that other encodings (in the
be relatively unimportant; however it seems that (especially for the
and also the PC world it seems) the legacy applications are still very much
alive and kicking.
You might think "wouldn't it be simpler if unlabelled XML just used my
default encoding?" Well, how would that work unless there is someone at the
receiving end to check that the encoding you used iss the same as
users don't have the ability to check encodings, especially with any kind of
large document, and often the receiving end may be a computer. It is
much simpler to state what the encoding used is rather than to have some
guessing system...especially given that encoding is not always guessable,
especially for performance reasons.
* providing the characters you have used are in both character sets