Re: [xml-dev] Here's how to process XML documents written in German

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Jim Melton <jim.melton@oracle.com>
To: "Costello, Roger L." <costello@mitre.org>
Date: Thu, 31 Jan 2013 15:23:36 -0700

Roger,

I'd like to observe that this is in no way an 
"XML problem", nor should an XML solution be 
pursued (e.g., identity transforms).  This is a 
file encoding problem in the form of a character 
normalization form.  If one deals with 
information sources that randomly encode 
characters using both precomposed and decomposed 
forms, or characters based on appearance on one 
medium or another, then one must decide how to 
resolve the problem based on those facts.

It's awfully difficult to devise an automatic way 
to determine whether a zero (0) in an identifier 
-- or in data -- was meant to be a capital O or 
not, so I won't pretend that there is a solution 
to this kind of carelessness.  Information 
sources that careless are as likely to use 
lowercase letter l (ell) for digit 1 (one), which 
would make, for example, addition mightily difficult.

However, the problem of characters sometimes 
presented in composed form and other times in 
decomposed form can be addressed by running the 
data (including data in XML representation) 
through a character normalization step before 
processing that data in any (other) 
application.  In my opinion, it is better to 
ensure that data received from any source (all 
sources) is correct than to write reams of XML 
code to compensate for bad data.

XML is merely a tool, not the answer to every 
problem on the planet.  It can be abused just as 
surely as a hammer can be used for murder 
(although one hopes not with similar effect!).

Hope this helps,
    Jim

On Jan 30, 2013, at 10:49 AM, "Costello, Roger L." <costello@mitre.org> wrote:

> > Hi Folks,
> >
> > Thanks to Wolfgang Laun for some German translations.
> >
> > Scenario: Your application has just received 
> this XML document containing a contract written in German:
> >
> >    <?xml version="1.0" encoding="UTF-8"?>
> >    <Kontrakt>
> >              <Posten währung="EUR">23.45</Posten>
> >              <Posten währung="EUR">45.00</Posten>
> >              <Posten währung="USD">39.99</Posten>
> >              <Posten>99.00</Posten>
> >              <Posten monetär-allianz="EUR">66.66</Posten>
> >    </Kontrakt>
> >
> > Your application wants to compute the sum of 
> all the items (Posten) with currency (währung) in Euros.
> >
> > Clearly the result should be:
> >
> >    23.45 + 45.00 = 68.45
> >
> > The application applies this XPath expression to the XML document:
> >
> >    sum(//Posten[@währung eq 'EUR'])
> >
> > The output is this:
> >
> >    23.45
> >
> > Wrong result!
> >
> > What happened?
> >
> > The XPath seems pretty straightforward:
> >
> >       Give me the sum of all Posten
> >    elements that have an attribute
> >    währung equal to 'EUR'.
> >
> > We need to dig into this a bit to see exactly what is going on.
> >
> > First some background information:
> >
> > According to Unicode the character ä can be 
> represented in these equivalent ways:
> >
> > 1. As just ä (this is called a precomposed character)
> >
> > 2. As a combination of 'a' plus a "combining diaeresis" character
> >
> > Visualization tools display both ways identically.
> >
> > So even though these two tags appear identical:
> >
> >    währung="EUR"
> >    währung="EUR"
> >
> > inside the computer the bytes are very different:
> >
> >    ä is represented in the computer as
> >    these bytes: C3 A4
> >
> >    'a' + combining diaeresis character is
> >    represented in the computer as these
> >    bytes: 61 CC 88
> >
> > "So what?" you ask. Well, the XPath engine 
> does string matching by matching bytes, the 
> XPath expression used the precomposed 
> character, so the XPath tool found one währung="EUR" but not the other.
> >
> > How do we design the XPath to find all 
> occurrences of währung="EUR" regardless of the Unicode form that is used?
> >
> > Our XPath expression needs to express this:
> >
> >       Give me the sum of all Posten
> >    elements that have an attribute
> >    whose name after normalization
> >    is währung and has a value equal
> >       to 'EUR'.
> >
> > This XPath expression does the job:
> >
> > sum(//Posten[@*[normalize-unicode(name(.)) eq 
> normalize-unicode('währung')][. eq 'EUR']])
> >
> > The normalize-unicode() function converts an 
> attribute name into a standard, canonical form.
> >
> > Lesson Learned:
> >
> > When processing markup with diacritical 
> marks, beware that two characters may visually 
> appear the same but inside the computer they 
> are represented very differently. Design XPath 
> expressions accordingly -- use 
> normalize-unicode() to convert markup into canonical form.
> >
> > /Roger
> > >

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
========================================================================
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================

References:
- Here's how to process XML documents written in German
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: [xml-dev] Here's how to process XML documents written in German
  - From: David Lee <dlee@calldei.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]