OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] An XML document is not well-formed ifencoding="..." does not match the actual encoding of the characters inthe document, right?

David: [re-send, including the xml-dev list]

At 2:53 AM +0000 12/30/12, David Lee wrote:
>For people who use languages which have predominantly non-latin codepoints ...
>Is UTF8 actually worse than UTF32  - file size wise ?

No, I believe not. Deducing from the definition of UTF-8 and UTF-32, 
there is no sequence of Unicode character values for which the UTF-8 
representation requires more bytes than the UTF-32 representation. On 
the contrary, in all but pathological cases the UTF-8 representation 
will require fewer bytes.

The best answer to the Stack Overflow question, "at all times text 
encoded in UTF-8 will never give us more than a +50% file size of the 
same text encoded in UTF-16. true / false?",

has a case study comparing the number of characters and UTF8 bytes 
for the text content of several language versions of the Wikipedia 
"Tokyo" article.  Extending the results table there a bit, we see 
that the ratio of bytes-for-UTF-8 / bytes-for-UTF-32 ranged from a 
high of 65% (for Japanese) to a low of 26% (for English, Spanish, and 

While we're at it, note that the ratio of bytes-for-UTF-8 / 
bytes-for-UTF-16 ranged from a high of 129% (again for Japanese) to a 
low of 51% (for English).  Actually, Japanese, Korean and simplified 
Chinese were the only languages in the sample where UTF-8 took more 
bytes than UTF-16. For Traditional Chinese and all other languages in 
the sample, UTF-8 was more compact.

>And does it matter much ?

I would say, with just a little bit of snark, that anyone choosing to 
mark up their document with an XML language has already declared they 
don't care much about file size being bloated. :-)

But there are other factors in choosing a Unicode Transformation 
Format (UTF) to represent text. For some applications, UTF-32's 1:1 
mapping of code unit to character might valuable.

>Considering that UTF16 is a dangerous file format,  (I agree it is ... )

Personally, I don't concede that point. It's harder to use it with 
tools that assume byte-aligned code units.  But there are many tools 
which are happy to work with 16-bit code units.

>I dont think any convention that requires you to have read "the 
>Beginning" will consistently work with text ...
>XML suffers with this assumption as well with the XML declaration 
>declaring the encoding.
>That only works when you have an entire document to look at. ...

I very much agree with this observation.

     --Jim DeLaHunt, jdlh@jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant

       157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
          Canada mobile +1-604-376-8953

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS