XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] An XML document is not well-formed if encoding="..." does notmatch the actual encoding of the characters in the document, right?

Roger,

running the modified file through an identity transform will result in
the error you searched for, see below. Reason is that "70" is not a
valid 2nd byte for UTF-8 encodings, these are of the form "10xxxxxx".
http://en.wikipedia.org/wiki/Utf-8#Description

But you do not have a guarantee that failure happens.
Take for example this two character sequence "ä", it is "C3 A4" if
encoded in ISO-8859-1. If you now do your "utf-8" encoding
modification experiment, then this two bytes will be interpreted as
valid UTF-8 two byte encoding of "ä" character.


$ od -Ax -tcx1 Lopez.modified.xml
000000   <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
        3c  3f  78  6d  6c  20  76  65  72  73  69  6f  6e  3d  22  31
000010   .   0   "       e   n   c   o   d   i   n   g   =   "   u   t
        2e  30  22  20  65  6e  63  6f  64  69  6e  67  3d  22  75  74
000020   f   -   8   "                       ?   >  \n   <   N   a   m
        66  2d  38  22  20  20  20  20  20  3f  3e  0a  3c  4e  61  6d
000030   e   >   L 363   p   e   z   <   /   N   a   m   e   >  \n
        65  3e  4c  f3  70  65  7a  3c  2f  4e  61  6d  65  3e  0a
00003f
$


$ xsltproc identity.xsl Lopez.modified.xml
Lopez.modified.xml:2: parser error : Input is not proper UTF-8, indicate
encoding !
Bytes: 0xF3 0x70 0x65 0x7A
<Name>L�pez</Name>
       ^
unable to parse Lopez.modified.xml
$
$ saxon-6.5.5 Lopez.modified.xml identity.xsl
Error at byte 10 of file:/home/stammw/Lopez/Lopez.modified.xml:
  Error reported by XML parser: bad continuation of multi-byte UTF-8
sequence (code: 0x70)
Transformation failed: Run-time errors were reported
$
$ xalan identity.xsl -IN Lopez.modified.xml

(Location of error unknown)XSLT Error
(javax.xml.transform.TransformerException):
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
Exception in thread "main" java.lang.RuntimeException:
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
	at org.apache.xalan.xslt.Process.doExit(Unknown Source)
	at org.apache.xalan.xslt.Process.main(Unknown Source)
$
$ cat identity.xsl
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>
  <xsl:output method="xml"/>

  <xsl:template match="/">
    <xsl:copy-of select="."/>
  </xsl:template>

</xsl:stylesheet>
$


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Level 3 support for XML Compiler team and Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
https://twitter.com/HermannSW/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


|------------>
| From:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |"Costello, Roger L." <costello@mitre.org>                                                                                                |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>,                                                                                         |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |12/28/2012 09:39 PM                                                                                                                      |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |[xml-dev] An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document,     |
  |right?                                                                                                                                   |
  >-----------------------------------------------------------------------------------------------------------------------------------------|





Thanks Chris for pointing us to that article: XML on the Web has Failed

I am making my way through it.

This statement in the article piqued my interest:

    ... determining the actual character encoding of an
    XML document is a prerequisite for determining its
    well-formedness ...

I decided to do an experiment.

I created this XML document and encoded each character in the document
using the iso-8859-1 encoding and in the encoding="..." I asserted that I
am using the iso-8859-1 encoding:

<?xml version="1.0" encoding="iso-8859-1"?>
<Name>López</Name>

I checked the document for well-formedness and the XML parser said it is
well-formed.

Good.

Then I changed encoding="iso-8859-1" to encoding="utf-8":

<?xml version="1.0" encoding="utf-8"?>
<Name>López</Name>

I checked it for well-formedness and the parser said it is still
well-formed.

Huh?

Shouldn't I have gotten a well-formedness error?

I did my experiment using the latest version of Oxygen XML. I think that it
uses the Xerces XML Parser, right?

Is this a bug in Xerces?

/Roger



_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS