xml-dev - Re: [xml-dev] Question about UTF-8

Re: [xml-dev] Question about UTF-8

[ Lists Home | Date Index | Thread Index ]

To: Gustaf Liljegren <gustaf.liljegren@xml.se>
Subject: Re: [xml-dev] Question about UTF-8
From: Tim Bray <tbray@textuality.com>
Date: Thu, 28 Aug 2003 10:01:32 -0700
Cc: xml-dev@lists.xml.org
In-reply-to: <3.0.6.32.20030828190252.018476c0@m1.858.telia.com>
References: <3.0.6.32.20030828161218.00e17738@m1.858.telia.com> <3.0.6.32.20030828161218.00e17738@m1.858.telia.com> <3.0.6.32.20030828190252.018476c0@m1.858.telia.com>
User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.4) Gecko/20030624

Gustaf Liljegren wrote:

> But the question is about general
> ('non-XML-aware') text editors. A general editor has no idea of the
> encoding detection mechanism in XML, so I wonder how it knows that the
> octets C3 A4 should be written 'ä' and not 'Ã¤' (or something else).

It really has no way of knowing, in theory or in practice.  This is a 
big hairy problem.  If you're living in a heterogeneous environment 
where there are multiple encodings, this a good reason to insist on XML.

> Many users who see 'Ã¤' when they open a UTF-8 encoded XML document in a
> text editor, prefer to use ISO 8859-1 to avoid this effect.

That only works until you need to use a character that isn't in 8859-1, 
such as those used by about two thirds of the world's population.

> Maybe the answer is to stay in ISO 8859-1 (or whatever default encoding the
> editor has), but I was hoping it was possible to recommend using UTF-8 all
> the time (for European scripts).

The notion that you can count on never seeing non-European characters is 
a recipe for disaster in today's world.  Good solutions are: (a) as you 
suggest, use UTF-8 all the time, or (b) use XML for interchange.

-- 
Cheers, Tim Bray
         (ongoing fragmented essay: http://www.tbray.org/ongoing/)

References:
- Question about UTF-8
  - From: Gustaf Liljegren <gustaf.liljegren@xml.se>
- Re: [xml-dev] Question about UTF-8
  - From: Gustaf Liljegren <gustaf.liljegren@xml.se>

Prev by Date: Re: [xml-dev] Re: XML and the Relational Model -- CMM & ISO 9000 - FINAL POST TO BOTH THREADS BY ME.
Next by Date: Re: [xml-dev] Re: XML and the Relational Model -- CMM & ISO 9000 - FINAL POST TO BOTH THREADS BY ME.
Previous by thread: Re: [xml-dev] Question about UTF-8
Next by thread: Re: [xml-dev] Question about UTF-8
Index(es):
- Date
- Thread