XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Caution! XML parsers behave differently with whitespace specifieddirectly in attribute value versus whitespace specified via an ENTITY

Hi Folks,

I created a schema which declares an element “test” whose value must be the string: Column#1 tab (hex 9) Column#2:

<xs:element name="test">
   
<xs:simpleType>
       
<xs:restriction base="xs:string">
           
<xs:pattern value="Column#1&#x09;Column#2" />
       
</xs:restriction>
   
</xs:simpleType>
</xs:element>

This XML document conforms to the schema:

<test>Column#1&#x09;Column#2</test>

Good.

Next, I decided to do some abstraction: I created an ENTITY for the tab character and then used the entity in the declaration of the “test” element:

<!DOCTYPE xs:schema [
    <!ENTITY TAB '&#x09;'>
]>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
    
<xs:element name="test">
       
<xs:simpleType>
           
<xs:restriction base="xs:string">
               
<xs:pattern value="Column#1&TAB;Column#2" />
            
</xs:restriction>
       
</xs:simpleType>
   
</xs:element>
   
</xs:schema>

I validated the above XML document against the new schema and got this error message:

                The instance document has the content Column#1\tColumn#2,
                which does not match the pattern facet Column#1 Column#2.

Huh? My schema didn’t specify a space character between Column#1 and Column#2.

Michael Kay and Ken Holman filled me in on what’s happening. In the second schema (the one using the ENTITY) the pattern facet’s attribute value (Column#1&TAB;Column#2) is being “normalized” by the XML parser. That is, the tab symbol is being replaced by the space symbol. Oddly, in the first schema the pattern facet’s attribute value is not normalized. That seemingly arbitrary behavior has to do with an incomplete specification in the XML specification. [Lessons learned: (1) Writing a good specification is really, really hard. (2) When writing a specification you must nail down every last detail.] Michael Kay explains it this way:

It's called attribute value normalization, and is described in the
            XML specification. It's of the bizarreness of XML not being able
           to define consistently whether and when whitespace is significant.
           If you write a newline character entity explicitly in an attribute
            value, then it decides you probably intended it, but if a newline
            gets in there by expanding an entity reference, it decides that you
            probably didn't.

Yikes!

/Roger

 

 

 

 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS