[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Here's how to process XML documents written in German
- From: "Costello, Roger L." <costello@mitre.org>
- To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Wed, 30 Jan 2013 18:47:05 +0000
Hi Folks,
Thanks to Wolfgang Laun for some German translations.
Scenario: Your application has just received this XML document containing a contract written in German:
<?xml version="1.0" encoding="UTF-8"?>
<Kontrakt>
<Posten währung="EUR">23.45</Posten>
<Posten währung="EUR">45.00</Posten>
<Posten währung="USD">39.99</Posten>
<Posten>99.00</Posten>
<Posten monetär-allianz="EUR">66.66</Posten>
</Kontrakt>
Your application wants to compute the sum of all the items (Posten) with currency (währung) in Euros.
Clearly the result should be:
23.45 + 45.00 = 68.45
The application applies this XPath expression to the XML document:
sum(//Posten[@währung eq 'EUR'])
The output is this:
23.45
Wrong result!
What happened?
The XPath seems pretty straightforward:
Give me the sum of all Posten
elements that have an attribute
währung equal to 'EUR'.
We need to dig into this a bit to see exactly what is going on.
First some background information:
According to Unicode the character ä can be represented in these equivalent ways:
1. As just ä (this is called a precomposed character)
2. As a combination of 'a' plus a "combining diaeresis" character
Visualization tools display both ways identically.
So even though these two tags appear identical:
währung="EUR"
währung="EUR"
inside the computer the bytes are very different:
ä is represented in the computer as
these bytes: C3 A4
'a' + combining diaeresis character is
represented in the computer as these
bytes: 61 CC 88
"So what?" you ask. Well, the XPath engine does string matching by matching bytes, the XPath expression used the precomposed character, so the XPath tool found one währung="EUR" but not the other.
How do we design the XPath to find all occurrences of währung="EUR" regardless of the Unicode form that is used?
Our XPath expression needs to express this:
Give me the sum of all Posten
elements that have an attribute
whose name after normalization
is währung and has a value equal
to 'EUR'.
This XPath expression does the job:
sum(//Posten[@*[normalize-unicode(name(.)) eq normalize-unicode('währung')][. eq 'EUR']])
The normalize-unicode() function converts an attribute name into a standard, canonical form.
Lesson Learned:
When processing markup with diacritical marks, beware that two characters may visually appear the same but inside the computer they are represented very differently. Design XPath expressions accordingly -- use normalize-unicode() to convert markup into canonical form.
/Roger
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]