[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Re: UTF's considered best practice [was: Re:[xml-dev] nextml]
- From: rjelliffe <rjelliffe@allette.com.au>
- To: Uche Ogbuji <uche@ogbuji.net>
- Date: Fri, 10 Dec 2010 15:37:27 +1100
On Thu, 9 Dec 2010 08:28:22 -0700, Uche Ogbuji <uche@ogbuji.net> wrote:
> On Wed, Dec 8, 2010 at 11:31 PM, Jim DeLaHunt wrote:
>
> Sometimes UTF-16 is a more compact representation, sometimes UTF-8
> is. It depends on the frequency distribution of characters in the
> document. But they have equivalent descriptive power; either can
> represent any sequence of Unicode characters. If nextml adopts
> UTF-16, be aware that it can be serialised to bytes in either
> little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml
> should account for those possibilities. It should also allow for the
> special Byte-Order Mark character (BOM), which is used to distinguish
> the two.
>
> Thanks for all the great links and references. That backs up my
> suspicion that supporting a diversity of encodings is a matter of
> less
> urgency than it was when XML 1.0 was born.
I would be careful about taking what Unicode conference speakers say as
necessarily being authoratative rather than aspirational! But I
probably
do agree with them. :-)
It would be great if the XML interlude has swept all the old encodings
away
and ushered in a Unicode-only world, but it is a gamble (a gamble worth
taking, I think.) Does it cost that much? The issue that Windows* &
Java APIs
have default encodings based on locale and language decisions* still
remains.
In 1997, a good argument against only allowing UTF-* was that people
needed
an on-ramp to Unicode. So supporting other encodings was a way of
neutralizing
the problem: paradoxically, supporting other encodings was a way of
promoting
Unicode. (With Perl being a big win here: I remember reading that
it finally moved to Unicode because XML pushed things past the tipping
point.)
In 2010, perhaps that argument is not needed: the pro-active thing
might be
to provide off-ramps from the legacy encodings. New formats only in
UTF-8 might be the better idea.
> As for the BOM, yes, that should be key in any XML successor, as it
> is
> in XML 1.0 itself. In XML 1.0, you can tell the encoding even if
> it's not in the XML declaration because if not, it must either be
> UTF-8 (if there is no BOM), or UTF-8, UTF-16LE, UTF-16BE, etc.
> depending on BOM.
>
> If it's OK to say UTF only (and we banish the standalone declaration)
> , then there is no need for an explicit encoding declaration beside
> optional BOM.
Magic numbers are still useful.
Cheers
Rick Jelliffe
*
http://stackoverflow.com/questions/927652/why-encoding-default-getbytes-returns-different-results-in-vb-net-and-c
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]