Re: [xml-dev] Re: UTF's considered best practice [was: Re:[xml-dev] next

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Re: UTF's considered best practice [was: Re:[xml-dev] nextml]

From: rjelliffe <rjelliffe@allette.com.au>
To: Uche Ogbuji <uche@ogbuji.net>
Date: Fri, 10 Dec 2010 15:37:27 +1100

 On Thu, 9 Dec 2010 08:28:22 -0700, Uche Ogbuji <uche@ogbuji.net> wrote:
> On Wed, Dec 8, 2010 at 11:31 PM, Jim DeLaHunt  wrote:
>
>  Sometimes UTF-16 is a more compact representation, sometimes UTF-8
> is. It depends on the frequency distribution of characters in the
> document. But they have equivalent descriptive power; either can
> represent any sequence of Unicode characters.  If nextml adopts
> UTF-16, be aware that it can be serialised to bytes in either
> little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml
> should account for those possibilities. It should also allow for the
> special Byte-Order Mark character (BOM), which is used to distinguish
> the two.
>
> Thanks for all the great links and references.  That backs up my
> suspicion that supporting a diversity of encodings is a matter of 
> less
> urgency than it was when XML 1.0 was born.

 I would be careful about taking what Unicode conference speakers say as
 necessarily being authoratative rather than aspirational! But I 
 probably
 do agree with them. :-)

 It would be great if the XML interlude has swept all the old encodings 
 away
 and ushered in a Unicode-only world, but it is a gamble (a gamble worth
 taking, I think.) Does it cost that much? The issue that Windows* & 
 Java  APIs
 have default encodings based on locale and language decisions* still 
 remains.

 In 1997, a good argument against only allowing UTF-* was that people 
 needed
 an on-ramp to Unicode. So supporting other encodings was a way of 
 neutralizing
 the problem: paradoxically, supporting other encodings was a way of 
 promoting
 Unicode. (With Perl being a big win here: I remember reading that
 it finally moved to Unicode because XML pushed things past the tipping 
 point.)

 In 2010, perhaps that argument is not needed: the pro-active thing 
 might be
 to provide off-ramps from the legacy encodings. New formats only in
 UTF-8 might be the better idea.

> As for the BOM, yes, that should be key in any XML successor, as it 
> is
> in XML 1.0 itself.  In XML 1.0, you can tell the encoding even if
> it's not in the XML declaration because if not, it must either be
> UTF-8 (if there is no BOM), or UTF-8, UTF-16LE, UTF-16BE, etc.
> depending on BOM.
>
> If it's OK to say UTF only (and we banish the standalone declaration)
> , then there is no need for an explicit encoding declaration beside
> optional BOM.

 Magic numbers are still useful.

 Cheers
 Rick Jelliffe

 * 
 http://stackoverflow.com/questions/927652/why-encoding-default-getbytes-returns-different-results-in-vb-net-and-c

References:
- nextml
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] nextml
  - From: Uche Ogbuji <uche@ogbuji.net>
- Re: UTF's considered best practice [was: Re: [xml-dev] nextml]
  - From: Uche Ogbuji <uche@ogbuji.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]