XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] relax UTF-8 default?

I doubt that specifying the encoding in a more restrictive way will help that much.  We still have lots of places where data entry or transcoding take place that can mess up the document.  I'd suspect that the problems arise from people not understanding the transcoding or the default character sets involved in what they are working on.

How often do people actually explicitly specify the character set of an XML document as 8859-1 or 1252?  

On Thu, Dec 16, 2010 at 2:06 PM, Jim DeLaHunt <from.xml-dev@jdlh.com> wrote:
On Fri, 10 Dec 2010 10:43:32 +0000 Andrew Welch <andrew.j.welch@gmail.com> wrote:


 Yep - the "UTF-8/16 only" suggestion is to solve the problem of the
 potential mismatch between the encoding in the prolog and the actual
 encoding.. add to that the content-type when http is involved and you
 have 3 areas to look at to determine the encoding...

At 12:12 PM +0000 12/10/10, Dave Pawson wrote:
...

What are the alternatives... if any?
Some app to analyse/guess the encoding and propose changes/
set the encoding? Is such a beast possible?

I think the experiment on analysing/guessing encodings has been conducted, in the form of HTML files on the public Web.  Similar to XML, the specification allows files to be stored in a wide range of encodings; similar ot XML, there are in-band (and also out-of-band) ways to state the file's encoding within the file.  Web browsers like Firefox, and web crawlers like Google's have code to analyse and guess the encoding of the pages they encounter.

Good implementations to examine are:
* "Character Set Detection" feature of the Internationalization Classes for Unicode (ICU). <http://userguide.icu-project.org/conversion/detection>
* "Mozilla Charset Detectors" <http://www.mozilla.org/projects/intl/chardet.html>
* There are others, easily found by a web search, but none that I saw struck me as more authoritative than these two.

My impression from being in the internationalization arena is that the history of encoding declarations has been fraught with error, especially the case of documents labelled as ISO 8859/1 encoding which are actually Windows CP-1252 encoding.  Also, detection is difficult and also fraught with error, especially for short documents.

Consider this document
 <element>C2A0</element>
where "C2A0" stands for the two octets with values 0xC2 and 0xA0. Does C2A0 represent the UTF-8 sequence for U+00A0 "non-breaking space character" or the Windows CP-1252 characters "A with circonflex" "non-breaking space character"? The short document gives very little context for a detection algorithm to use.

The Unicode UTF's bypass all of these problems.  They can represent any character from the older code pages, it is now reasonable to expect that authoring tools can save in UTF-8 or UTF-16{BE|LE}, with UTFs as the only encoding option there is no ambiguity, and it is straightforward to distinguish between octet streams containing UTF-8 and UTF-16{BE|LE}.

There's a reason why more than half of the public web is Unicode-encoded. My opinion is that it would be wise for NextXML to require either UTF-8 or UTF-16 encoding, and offer no other choices. The spec will be simpler, and interchange will be more reliable.
--
   --Jim DeLaHunt, jdlh@jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
     multilingual websites consultant

     157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
        Canada mobile +1-604-376-8953


_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS