RE: [xml-dev] An XML document is not well-formed if encoding="..." does

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] An XML document is not well-formed if encoding="..." doesnot match the actual encoding of the characters in the document, right?

From: Hermann Stamm-Wilbrandt <STAMMW@de.ibm.com>
To: Jim DeLaHunt <from.xml-dev@jdlh.com>
Date: Sun, 30 Dec 2012 19:00:55 +0100

Hi Jim,

> While we're at it, note that the ratio of bytes-for-UTF-8 /
> bytes-for-UTF-16 ranged from a high of 129% (again for Japanese) to a
> low of 51% (for English).  Actually, Japanese, Korean and simplified
> Chinese were the only languages in the sample where UTF-8 took more
> bytes than UTF-16. For Traditional Chinese and all other languages in
> the sample, UTF-8 was more compact.
>
thanks for sharing these (interesting) numbers.

> >And does it matter much ?
>
> I would say, with just a little bit of snark, that anyone choosing to
> mark up their document with an XML language has already declared they
> don't care much about file size being bloated. :-)
>
;-)

I think one of the main reasons for using UTF-8 in XML processing systems
is that it is the default encoding in stylesheets (when  encoding="..."
is missing in <xsl:output>.

I do not know the internals of other XSLT processors, but DataPower XSLT
processor internal encoding is UTF-8 ([1], slide 11).

Btw, proprietary extensions allowing to process Non-XML data in
stylesheets can enable even (XML) processing for character encodings the
XSLT processor cannot handle directly ([2], slides 12-14).
(the audio recording has the details)

[1] http://www-01.ibm.com/support/docview.wss?uid=swg27022977
[2] http://www-01.ibm.com/support/docview.wss?uid=swg27022979

Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Level 3 support for XML Compiler team and Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
https://twitter.com/HermannSW/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

|------------>
| From:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |Jim DeLaHunt <from.xml-dev@jdlh.com>                                                                                                     |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |David Lee <dlee@calldei.com>,                                                                                                            |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Cc:        |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |Chris Maloney <voldrani@gmail.com>, "Costello, Roger L." <costello@mitre.org>, "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>           |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |12/30/2012 06:49 AM                                                                                                                      |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |RE: [xml-dev] An XML document is not well-formed if encoding="..."  does not match the actual encoding of the characters in the document,|
  |right?                                                                                                                                   |
  >-----------------------------------------------------------------------------------------------------------------------------------------|

David: [re-send, including the xml-dev list]

At 2:53 AM +0000 12/30/12, David Lee wrote:
>For people who use languages which have predominantly non-latin
codepoints ...
>Is UTF8 actually worse than UTF32  - file size wise ?

No, I believe not. Deducing from the definition of UTF-8 and UTF-32,
there is no sequence of Unicode character values for which the UTF-8
representation requires more bytes than the UTF-32 representation. On
the contrary, in all but pathological cases the UTF-8 representation
will require fewer bytes.

The best answer to the Stack Overflow question, "at all times text
encoded in UTF-8 will never give us more than a +50% file size of the
same text encoded in UTF-16. true / false?",
http://stackoverflow.com/questions/6883434/at-all-times-text-encoded-in-utf-8-will-never-give-us-more-than-a-50-file-size

,

has a case study comparing the number of characters and UTF8 bytes
for the text content of several language versions of the Wikipedia
"Tokyo" article.  Extending the results table there a bit, we see
that the ratio of bytes-for-UTF-8 / bytes-for-UTF-32 ranged from a
high of 65% (for Japanese) to a low of 26% (for English, Spanish, and
French).

While we're at it, note that the ratio of bytes-for-UTF-8 /
bytes-for-UTF-16 ranged from a high of 129% (again for Japanese) to a
low of 51% (for English).  Actually, Japanese, Korean and simplified
Chinese were the only languages in the sample where UTF-8 took more
bytes than UTF-16. For Traditional Chinese and all other languages in
the sample, UTF-8 was more compact.

>And does it matter much ?

I would say, with just a little bit of snark, that anyone choosing to
mark up their document with an XML language has already declared they
don't care much about file size being bloated. :-)

But there are other factors in choosing a Unicode Transformation
Format (UTF) to represent text. For some applications, UTF-32's 1:1
mapping of code unit to character might valuable.

>Considering that UTF16 is a dangerous file format,  (I agree it is ... )

Personally, I don't concede that point. It's harder to use it with
tools that assume byte-aligned code units.  But there are many tools
which are happy to work with 16-bit code units.

>I dont think any convention that requires you to have read "the
>Beginning" will consistently work with text ...
>XML suffers with this assumption as well with the XML declaration
>declaring the encoding.
>That only works when you have an entire document to look at. ...

I very much agree with this observation.

--
     --Jim DeLaHunt, jdlh@jdlh.com     http://blog.jdlh.com/ (
http://jdlh.com/)
       multilingual websites consultant

       157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
          Canada mobile +1-604-376-8953

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

Follow-Ups:
- RE: [xml-dev] An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document,right?
  - From: Liam R E Quin <liam@w3.org>

References:
- RE: [xml-dev] An XML document is not well-formed ifencoding="..." does not match the actual encoding of the characters inthe document, right?
  - From: Jim DeLaHunt <from.xml-dev@jdlh.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]