Re: [xml-dev] Here's how to process XML documents written in German

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Michael Kay <mike@saxonica.com>
To: xml-dev@lists.xml.org
Date: Wed, 30 Jan 2013 20:39:04 +0000

The odd thing is, people have known about this problem for years, but 
I've never once seen it happen in the real world. There are plenty of 
character encoding problems that cause a lot more hassle in real life 
than this one, which as David Lee points out, is easily avoided by 
applying schema validation to the input (and using a schema-aware 
stylesheet, if you really want to be safe).

Michael Kay
Saxonica

On 30/01/2013 18:47, Costello, Roger L. wrote:
> Hi Folks,
>
> Thanks to Wolfgang Laun for some German translations.
>
> Scenario: Your application has just received this XML document containing a contract written in German:
>
> 	<?xml version="1.0" encoding="UTF-8"?>
> 	<Kontrakt>
>      	      <Posten währung="EUR">23.45</Posten>
>      	      <Posten währung="EUR">45.00</Posten>
>      	      <Posten währung="USD">39.99</Posten>
>      	      <Posten>99.00</Posten>
>      	      <Posten monetär-allianz="EUR">66.66</Posten>
> 	</Kontrakt>
>
> Your application wants to compute the sum of all the items (Posten) with currency (währung) in Euros.
>
> Clearly the result should be:
>
> 	23.45 + 45.00 = 68.45
>
> The application applies this XPath expression to the XML document:
>
> 	sum(//Posten[@währung eq 'EUR'])
>
> The output is this:
>
> 	23.45
>
> Wrong result!
>
> What happened?
>
> The XPath seems pretty straightforward:
>
>     	Give me the sum of all Posten
> 	elements that have an attribute
> 	währung equal to 'EUR'.
>
> We need to dig into this a bit to see exactly what is going on.
>
> First some background information:
>
> According to Unicode the character ä can be represented in these equivalent ways:
>
> 1. As just ä (this is called a precomposed character)
>
> 2. As a combination of 'a' plus a "combining diaeresis" character
>
> Visualization tools display both ways identically.
>
> So even though these two tags appear identical:
>
> 	währung="EUR"
> 	währung="EUR"
>
> inside the computer the bytes are very different:
>
> 	ä is represented in the computer as
> 	these bytes: C3 A4
>
> 	'a' + combining diaeresis character is
> 	represented in the computer as these
> 	bytes: 61 CC 88
>
> "So what?" you ask. Well, the XPath engine does string matching by matching bytes, the XPath expression used the precomposed character, so the XPath tool found one währung="EUR" but not the other.
>
> How do we design the XPath to find all occurrences of währung="EUR" regardless of the Unicode form that is used?
>
> Our XPath expression needs to express this:
>
>     	Give me the sum of all Posten
> 	elements that have an attribute
> 	whose name after normalization
> 	is währung and has a value equal
>     	to 'EUR'.
>
> This XPath expression does the job:
>
> sum(//Posten[@*[normalize-unicode(name(.)) eq normalize-unicode('währung')][. eq 'EUR']])
>
> The normalize-unicode() function converts an attribute name into a standard, canonical form.
>
> Lesson Learned:
>
> When processing markup with diacritical marks, beware that two characters may visually appear the same but inside the computer they are represented very differently. Design XPath expressions accordingly -- use normalize-unicode() to convert markup into canonical form.
>
> /Roger

Follow-Ups:
- Re: [xml-dev] Here's how to process XML documents written in German
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

References:
- Here's how to process XML documents written in German
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]