RE: [xml-dev] An XML document is not well-formed if encoding="..."does n

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

RE: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?

From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Sat, 29 Dec 2012 14:38:10 +0000

Michael Sokolov wrote:

    We can't really tell what's going on without 
    access to your entire tool chain.  (It's unlikely 
    that the encoding of the characters in this 
    email is byte-identical with the files you 
    created.) It's possible that your editor changed 
    the character encoding of your text when you 
   changed the XML declaration (emacs does this)!

Thanks Michael, Michael, and Hermann. You guys nailed the problem. A whole bunch of encoding transformations must be happening under-the-hood inside Oxygen XML.

I followed Hermann's lead and performed an identity transformation from a DOS command line and I got the desired error message.

--------------------
Lessons Learned
--------------------
1. If encoding="..." does not match the actual encoding of the characters in the XML document, then the XML parser should raise an error.

2. Integrated Development Environments (IDEs) may perform character encoding conversions behind-the-scene, thus resulting in no error being raised.

--------------------------------------------------------------------------
Mind-bogglingly Fascinating Statements Made In This Thread
--------------------------------------------------------------------------
    ... around 1972 ... "Why is character coding so complicated?" ... it hasn't 
    become any simpler in the intervening 40 years.

    To be able to track end-to-end the path of conversions and validate that 
    your application from authoring through to storage through to search 
    and retrieval is completely correct is amazingly difficult ... it's a skill 
    far too few programmers have, or even recognize that they do not have.

    We can't really tell what's going on without access to your entire tool chain.  

    It's possible that your editor changed the character encoding of your text 
    when you changed the XML declaration (emacs does this)!

    It's unlikely that the encoding of the characters in this email is byte-identical 
    with the files you created.

    ... my preferred solution is to stick to a single encoding everywhere ...

    ... I vote for UTF-8 ... 

    ... make sure *every single link in the chain* uses that encoding.

-----------
Question
-----------
This outstanding discussion has awakened me to the problems with the multiplicity of character encodings and the huge number of character encoding conversions taking place behind-the-scene. 

Is the solution to the problems to simply eliminate the need for conversions by mandating that every application, every IDE, every text editor, and every system worldwide adopt one character encoding, UTF-8? It that a realistic solution? If so, what is the timeframe in which it could be achieved?

/Roger


-----Original Message-----
From: Michael Sokolov [mailto:sokolov@ifactory.com] 
Sent: Friday, December 28, 2012 4:29 PM
To: Costello, Roger L.
Cc: xml-dev@lists.xml.org
Subject: Re: [xml-dev] An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document, right?

Your experiment illustrate's David Lee's point regarding the difficulty 
of this whole problem.  We can't really tell what's going on without 
access to your entire toolchain.  (It's unlikely that the encoding of 
the characters in this email is byte-identical with the files you 
created.) It's possible that your editor changed the character encoding 
of your text when you changed the XML declaration (emacs does this)!

It's also possible (I haven't checked) that the bytes in your text are 
valid UTF-8 *and* valid ISO-8859-1, althought they would represent 
different characters in the two systems.

-Mike

On 12/28/2012 3:37 PM, Costello, Roger L. wrote:
> Thanks Chris for pointing us to that article: XML on the Web has Failed
>
> I am making my way through it.
>
> This statement in the article piqued my interest:
>
>      ... determining the actual character encoding of an
>      XML document is a prerequisite for determining its
>      well-formedness ...
>
> I decided to do an experiment.
>
> I created this XML document and encoded each character in the document using the iso-8859-1 encoding and in the encoding="..." I asserted that I am using the iso-8859-1 encoding:
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <Name>L�pez</Name>
>
> I checked the document for well-formedness and the XML parser said it is well-formed.
>
> Good.
>
> Then I changed encoding="iso-8859-1" to encoding="utf-8":
>
> <?xml version="1.0" encoding="utf-8"?>
> <Name>L�pez</Name>
>
> I checked it for well-formedness and the parser said it is still well-formed.
>
> Huh?
>
> Shouldn't I have gotten a well-formedness error?
>
> I did my experiment using the latest version of Oxygen XML. I think that it uses the Xerces XML Parser, right?
>
> Is this a bug in Xerces?
>

Follow-Ups:
- RE: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: "Costello, Roger L." <costello@mitre.org>
- RE: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: Liam R E Quin <liam@w3.org>
- RE: [xml-dev] An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document, right?
  - From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- Re: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: Michael Sokolov <sokolov@ifactory.com>

References:
- An XML document is not well-formed if encoding="..." does not matchthe actual encoding of the characters in the document, right?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] An XML document is not well-formed if encoding="..."does not match the actual encoding of the characters in the document, right?
  - From: Michael Sokolov <sokolov@ifactory.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]