XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] RE: There is a serious amount of character encodingconversions occurring inside our computers and on the Web

Argh.  Let's try that again:

> I'd be very interested to hear if any of the XML / character
> encoding gurus on this list have any comments
> or links to updates to this article (which was written in 2004).
>  I am not sure if the issues the author describes have
> been remedied or not.

In 2004, UTF-8 was a noise encoding on the Web: see <http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html>.  As of the beginning of 2012, it was more than 60% of the documents visible to Google.  If you count pure ASCII documents as UTF-8, which you can do, it's up at 80%.

If the trend line continues, which is of course not something you can count on, I'd expect to see UTF-8 rise by another 5% or so, though perhaps pure ASCII will drop by about half the same amount leaving the total situation nearly unchanged.  In short: More than 80% of the Web is now UTF-8 one way or another, and less than 10% is Latin-1 and related encodings, leaving just about 10% for all the rest. (UTF-16 is less than 0.1%, according to Mark Davis.)  Not exactly a ringing endorsement  for "publish in any encoding you want" (per the article), is it.


On Fri, Dec 28, 2012 at 2:45 PM, Chris Maloney <voldrani@gmail.com> wrote:
Roger,

Here is a classic post from XML.com that is right in line with the
topic of character encodings that you have been posting about
recently, titled "XML on the web has failed":
http://www.xml.com/pub/a/2004/07/21/dive.html

It takes some work to really grok the problems the author is
describing, but it is well worth it, I think, and may make your head
spin (or hurt, depending).

I'd be very interested to hear if any of the XML / character encoding
gurus on this list have any comments or links to updates to this
article (which was written in 2004).  I am not sure if the issues the
author describes have been remedied or not.

Chris


On Fri, Dec 28, 2012 at 12:17 PM, David Lee <dlee@calldei.com> wrote:
> ---------
>
> You are writing about character encoding conversions as text moves from
> point to point to point.
>
>
>
> Is there a parallel with markup? Are there markup conversions as XML moves
> from point to point to point?
>
>
>
> Are there lessons learned in the character encoding community that could be
> applied to the XML community?
>
>
>
> --------
>
>
>
>
>
> Markup is text and has the same problems (and solutions).
>
> If we could start over from scratch with what we know now there would be
> less problems.
>
>
>
>
>
> IMHO, my preferred solution is to stick to a single encoding everywhere (I
> vote for UTF8 ... as it handles all Unicode codepoints).
>
> The next step is to make sure *every single link in the chain* uses that
> encoding.
>
> This is amazingly difficult even in "modern" languages like Java where the
> default behavior of converting code points to strings is to use
>
> the *system default encoding* which is always an unknown.   Even in pure
> java you have to track every single point that a byte array is converted to
> a String and visa versa,
>
> and explicitly set the encoding.   (or guarantee the system encoding is
> correct).
>
> Then you have to manage all places the data enters and leaves the program
> and make sure it's in the right encoding.
>
> Then  you have to make sure all places that *store* the data (like a
> database) don't muck with it.
>
> XML Itself cannot solve this problem alone as an XML document is  only the
> payload ...  However the XML Tools tend to be a bit more mature about
> dealing with this.
>
> But not always.
>
>
>
> Maybe in another 30  years more we will have migrated all our tools to be
> consistant about encodings.
>
>
>
>
>
> ----------------------------------------
>
> David A. Lee
>
> dlee@calldei.com
>
> http://www.xmlsh.org
>
>
>
>
>
>

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php




--
GMail doesn't have rotating .sigs, but you can see mine at http://www.ccil.org/~cowan/signatures


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS