XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Here's how to process XML documents written in German

IMHO an excellent similarity to what happens of you start focusing on how characters look 
Is the whitespace language.






Sent from my iPad (excuse the terseness) 
David A Lee


On Jan 30, 2013, at 11:45 AM, "David Lee" <dlee@calldei.com> wrote:

IMHO this is misdirection and hurt not solved by your suggestion.
The real solution is to use XML properly in the first place, by using consistent QNames.
"A fence on the hill or an ambulance down in the valley"

A validation would catch this error ...
It is entirely irrelevant if the visual glyph of a codepoint look the same.
It's also irrelevant if 2 codepoints produce the same "character".
XML is defined wrt codepoints.  Like it or not.
Same fundamentally with most programming languages regardless if the printer makes O and 0 look similar or a l and 1.   Typographic similarity do not a codepoint equality make.

Trying to get around that will just cause more pain then learn how the language is designed.





Sent from my iPad (excuse the terseness)
David A Lee
dlee@calldei.com


On Jan 30, 2013, at 10:49 AM, "Costello, Roger L." <costello@mitre.org> wrote:

Hi Folks,

Thanks to Wolfgang Laun for some German translations.

Scenario: Your application has just received this XML document containing a contract written in German:

  <?xml version="1.0" encoding="UTF-8"?>
  <Kontrakt>
            <Posten währung="EUR">23.45</Posten>
            <Posten währung="EUR">45.00</Posten>
            <Posten währung="USD">39.99</Posten>
            <Posten>99.00</Posten>
            <Posten monetär-allianz="EUR">66.66</Posten>
  </Kontrakt>

Your application wants to compute the sum of all the items (Posten) with currency (währung) in Euros.

Clearly the result should be:

  23.45 + 45.00 = 68.45

The application applies this XPath expression to the XML document:

  sum(//Posten[@währung eq 'EUR'])

The output is this:

  23.45

Wrong result!

What happened?

The XPath seems pretty straightforward:

     Give me the sum of all Posten
  elements that have an attribute
  währung equal to 'EUR'.

We need to dig into this a bit to see exactly what is going on.

First some background information:

According to Unicode the character ä can be represented in these equivalent ways:

1. As just ä (this is called a precomposed character)

2. As a combination of 'a' plus a "combining diaeresis" character

Visualization tools display both ways identically.

So even though these two tags appear identical:

  währung="EUR"
  währung="EUR"

inside the computer the bytes are very different:

  ä is represented in the computer as
  these bytes: C3 A4

  'a' + combining diaeresis character is
  represented in the computer as these
  bytes: 61 CC 88

"So what?" you ask. Well, the XPath engine does string matching by matching bytes, the XPath expression used the precomposed character, so the XPath tool found one währung="EUR" but not the other.

How do we design the XPath to find all occurrences of währung="EUR" regardless of the Unicode form that is used?

Our XPath expression needs to express this:

     Give me the sum of all Posten
  elements that have an attribute
  whose name after normalization
  is währung and has a value equal
     to 'EUR'.

This XPath expression does the job:

sum(//Posten[@*[normalize-unicode(name(.)) eq normalize-unicode('währung')][. eq 'EUR']])

The normalize-unicode() function converts an attribute name into a standard, canonical form.

Lesson Learned:

When processing markup with diacritical marks, beware that two characters may visually appear the same but inside the computer they are represented very differently. Design XPath expressions accordingly -- use normalize-unicode() to convert markup into canonical form.

/Roger


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS