Here's how to process XML documents written in German

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Wed, 30 Jan 2013 18:47:05 +0000

Hi Folks,

Thanks to Wolfgang Laun for some German translations. 

Scenario: Your application has just received this XML document containing a contract written in German:

	<?xml version="1.0" encoding="UTF-8"?>
	<Kontrakt>
    	      <Posten währung="EUR">23.45</Posten>
    	      <Posten währung="EUR">45.00</Posten>
    	      <Posten währung="USD">39.99</Posten>
    	      <Posten>99.00</Posten>
    	      <Posten monetär-allianz="EUR">66.66</Posten>
	</Kontrakt>

Your application wants to compute the sum of all the items (Posten) with currency (währung) in Euros. 

Clearly the result should be:

	23.45 + 45.00 = 68.45

The application applies this XPath expression to the XML document:

	sum(//Posten[@währung eq 'EUR'])

The output is this:

	23.45

Wrong result! 

What happened? 

The XPath seems pretty straightforward:

   	Give me the sum of all Posten
	elements that have an attribute 
	währung equal to 'EUR'.

We need to dig into this a bit to see exactly what is going on.

First some background information:

According to Unicode the character ä can be represented in these equivalent ways:

1. As just ä (this is called a precomposed character)

2. As a combination of 'a' plus a "combining diaeresis" character

Visualization tools display both ways identically. 

So even though these two tags appear identical:

	währung="EUR"
	währung="EUR"

inside the computer the bytes are very different:

	ä is represented in the computer as 
	these bytes: C3 A4

	'a' + combining diaeresis character is 
	represented in the computer as these 
	bytes: 61 CC 88

"So what?" you ask. Well, the XPath engine does string matching by matching bytes, the XPath expression used the precomposed character, so the XPath tool found one währung="EUR" but not the other. 

How do we design the XPath to find all occurrences of währung="EUR" regardless of the Unicode form that is used?

Our XPath expression needs to express this:

   	Give me the sum of all Posten
	elements that have an attribute 
	whose name after normalization 
	is währung and has a value equal 
   	to 'EUR'.

This XPath expression does the job:

sum(//Posten[@*[normalize-unicode(name(.)) eq normalize-unicode('währung')][. eq 'EUR']])

The normalize-unicode() function converts an attribute name into a standard, canonical form.

Lesson Learned:

When processing markup with diacritical marks, beware that two characters may visually appear the same but inside the computer they are represented very differently. Design XPath expressions accordingly -- use normalize-unicode() to convert markup into canonical form.

/Roger

Follow-Ups:
- Re: [xml-dev] Here's how to process XML documents written in German
  - From: "Tony Graham" <tgraham@mentea.net>
- Re: [xml-dev] Here's how to process XML documents written in German
  - From: Michael Kay <mike@saxonica.com>
- Re: [xml-dev] Here's how to process XML documents written in German
  - From: David Lee <dlee@calldei.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]