[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] An XML document is not well-formed if encoding="..." does notmatch the actual encoding of the characters in the document, right?
- From: Hermann Stamm-Wilbrandt <STAMMW@de.ibm.com>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Sat, 29 Dec 2012 03:13:21 +0100
Roger,
running the modified file through an identity transform will result in
the error you searched for, see below. Reason is that "70" is not a
valid 2nd byte for UTF-8 encodings, these are of the form "10xxxxxx".
http://en.wikipedia.org/wiki/Utf-8#Description
But you do not have a guarantee that failure happens.
Take for example this two character sequence "ä", it is "C3 A4" if
encoded in ISO-8859-1. If you now do your "utf-8" encoding
modification experiment, then this two bytes will be interpreted as
valid UTF-8 two byte encoding of "ä" character.
$ od -Ax -tcx1 Lopez.modified.xml
000000 < ? x m l v e r s i o n = " 1
3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31
000010 . 0 " e n c o d i n g = " u t
2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 74
000020 f - 8 " ? > \n < N a m
66 2d 38 22 20 20 20 20 20 3f 3e 0a 3c 4e 61 6d
000030 e > L 363 p e z < / N a m e > \n
65 3e 4c f3 70 65 7a 3c 2f 4e 61 6d 65 3e 0a
00003f
$
$ xsltproc identity.xsl Lopez.modified.xml
Lopez.modified.xml:2: parser error : Input is not proper UTF-8, indicate
encoding !
Bytes: 0xF3 0x70 0x65 0x7A
<Name>L�pez</Name>
^
unable to parse Lopez.modified.xml
$
$ saxon-6.5.5 Lopez.modified.xml identity.xsl
Error at byte 10 of file:/home/stammw/Lopez/Lopez.modified.xml:
Error reported by XML parser: bad continuation of multi-byte UTF-8
sequence (code: 0x70)
Transformation failed: Run-time errors were reported
$
$ xalan identity.xsl -IN Lopez.modified.xml
(Location of error unknown)XSLT Error
(javax.xml.transform.TransformerException):
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
Exception in thread "main" java.lang.RuntimeException:
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
at org.apache.xalan.xslt.Process.doExit(Unknown Source)
at org.apache.xalan.xslt.Process.main(Unknown Source)
$
$ cat identity.xsl
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:output method="xml"/>
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
$
Mit besten Gruessen / Best wishes,
Hermann Stamm-Wilbrandt
Level 3 support for XML Compiler team and Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
https://twitter.com/HermannSW/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
|------------>
| From: |
|------------>
>-----------------------------------------------------------------------------------------------------------------------------------------|
|"Costello, Roger L." <costello@mitre.org> |
>-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To: |
|------------>
>-----------------------------------------------------------------------------------------------------------------------------------------|
|"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>, |
>-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date: |
|------------>
>-----------------------------------------------------------------------------------------------------------------------------------------|
|12/28/2012 09:39 PM |
>-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject: |
|------------>
>-----------------------------------------------------------------------------------------------------------------------------------------|
|[xml-dev] An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document, |
|right? |
>-----------------------------------------------------------------------------------------------------------------------------------------|
Thanks Chris for pointing us to that article: XML on the Web has Failed
I am making my way through it.
This statement in the article piqued my interest:
... determining the actual character encoding of an
XML document is a prerequisite for determining its
well-formedness ...
I decided to do an experiment.
I created this XML document and encoded each character in the document
using the iso-8859-1 encoding and in the encoding="..." I asserted that I
am using the iso-8859-1 encoding:
<?xml version="1.0" encoding="iso-8859-1"?>
<Name>López</Name>
I checked the document for well-formedness and the XML parser said it is
well-formed.
Good.
Then I changed encoding="iso-8859-1" to encoding="utf-8":
<?xml version="1.0" encoding="utf-8"?>
<Name>López</Name>
I checked it for well-formedness and the parser said it is still
well-formed.
Huh?
Shouldn't I have gotten a well-formedness error?
I did my experiment using the latest version of Oxygen XML. I think that it
uses the Xerces XML Parser, right?
Is this a bug in Xerces?
/Roger
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]