[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Why does validation fail with a named ENTITY for carriage returnand line feed?
- From: "Costello, Roger L." <costello@mitre.org>
- To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Wed, 24 Oct 2012 16:17:04 +0000
Hi Folks,
I want to specify the format of a "From:" field for email messages. The requirement is:
1. It starts with the literal "From:
2. Then there are one or more characters, a - z
3. Then the @ symbol
4. Then there are one or more characters, a - z
5. Then there is a carriage return (decimal 13) followed by a line feed (decimal 10)
A regular expression in the XML Schema pattern facet is well-suited for expressing that requirement:
<xs:element name="from">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="From:[a-z]+@[a-z\.]+ "/>
</xs:restriction>
</xs:simpleType>
</xs:element>
Great.
Here is a sample instance document:
<from>From:jdoe@machine.example </from>
That validates beautifully against the XML Schema.
Now, many email fields must end with CRLF so I declared an XML ENTITY that I can reuse:
<!ENTITY CRLF " ">
I then changed the pattern facet to reference the named ENTITY:
<xs:element name="from">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="From:[a-z]+@[a-z\.]+&CRLF;"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
When I validate the above instance document I get this error:
The content "From:jdoe@machine.example\r\n"
of element <from> does not match the required
simple type. Value "From:jdoe@machine.example\r\n"
contravenes the pattern facet "From:[a-z]+@[a-z\.]+ "
of the type of element <from>.
Huh?
What's going on?
Why does the instance document validate when the character entities are explicitly provided in the pattern facet, but the instance document fails validation when a named ENTITY is used in the pattern facet?
The problem is not with the XML Schema validator. The problem is at a lower level. The problem is with the XML Parser.
Look again at the pattern facet:
<xs:pattern value="From:[a-z]+@[a-z\.]+&CRLF;"/>
Ignore the fact that it is XML Schema stuff. It is XML. We have an element <xs:pattern> and it has one attribute, value, which has this value: From:[a-z]+@[a-z\.]+&CRLF;
What does an XML parser do to attribute values? Answer: it normalizes attribute values. (http://www.w3.org/TR/REC-xml/#AVNormalize)
The XML normalization algorithm says this:
For an entity reference, recursively apply step 3
of this algorithm to the replacement text of the entity.
Okay, let's replace &CRLF; with its replacement text:
<xs:pattern value="From:[a-z]+@[a-z\.]+ "/>
The normalization algorithm then says:
For a white space character (#32, #13, #10, #9),
append a space character (#32) to the normalized value.
Okay, that yields:
<xs:pattern value="From:[a-z]+@[a-z\.]+ "/>
Note the two spaces at the end of the regular expression.
So normalization of this:
<xs:pattern value="From:[a-z]+@[a-z\.]+&CRLF;"/>
produces this:
<xs:pattern value="From:[a-z]+@[a-z\.]+ "/>
Hold on!
Why doesn't this:
<xs:pattern value="From:[a-z]+@[a-z\.]+ "/>
also normalize to this:
<xs:pattern value="From:[a-z]+@[a-z\.]+ "/>
I'm confused. Why does validation fail with named ENTITIES and succeed with character entities?
/Roger
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]