[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Caution! XML parsers behave differently with whitespace specified directly in attribute value versus whitespace specified via an ENTITY
- From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- To: "Costello, Roger L." <costello@mitre.org>,"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Fri, 08 Apr 2016 09:56:45 -0400
I think you've posted an incomplete story here Roger for those who
are not following the discussion on the XSL List.
At 2016-04-08 11:49 +0000, Costello, Roger L. wrote:
I created a schema which declares an element "test" whose value must
be the string: Column#1 tab (hex 9) Column#2:
<xs:element name="test">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="Column#1	Column#2" />
Fine ... the sequence you want at the time it is being parsed is
"	" ... that's six characters.
</xs:restriction>
</xs:simpleType>
</xs:element>
This XML document conforms to the schema:
<test>Column#1	Column#2</test>
Good.
Next, I decided to do some abstraction: I created an ENTITY for the
tab character and then used the entity in the declaration of the
"test" element:
<!DOCTYPE xs:schema [
<!ENTITY TAB '	'>
You've only defined a replacement string of one character there, not six.
If you want six, you need to define six:
<!ENTITY TAB '&#x09;'>
]>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="test">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="Column#1&TAB;Column#2" />
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:schema>
I validated the above XML document against the new schema and got
this error message:
The instance document has the content Column#1\tColumn#2,
which does not match the pattern facet Column#1 Column#2.
Huh? My schema didn't specify a space character between Column#1 and Column#2.
Michael Kay and Ken Holman filled me in on what's happening. In the
second schema (the one using the ENTITY) the pattern facet's
attribute value (Column#1&TAB;Column#2) is being "normalized" by the
XML parser. That is, the tab symbol is being replaced by the space symbol.
Yes.
Oddly, in the first schema the pattern facet's attribute value is
not normalized.
Actually, it is being normalized because preserving numeric character
references is part of normalization. I agree the numeric character
reference for the tab character isn't being translated, but
normalization is explicit about preserving the references.
That seemingly arbitrary behavior has to do with an incomplete
specification in the XML specification.
I disagree! https://www.w3.org/TR/REC-xml/#AVNormalize is quite
complete and doesn't leave anything ambiguous about the normalization
process. All processors are supposed to behave the same way.
[Lessons learned: (1) Writing a good specification is really, really
hard. (2) When writing a specification you must nail down every last
detail.] Michael Kay explains it this way:
It's called attribute value normalization, and is described in the
XML specification. It's of the bizarreness of XML not being able
to define consistently whether and when whitespace is significant.
If you write a newline character entity explicitly in an
attribute
value, then it decides you probably intended it, but if
a newline
gets in there by expanding an entity reference, it
decides that you
probably didn't.
Yikes!
That is an appropriate description of the thinking, but that doesn't
imply that the specification is loose. The specification is very
tight in this regard, given the assumptions stated by Michael.
Entities have an important role in expressing what the author wants
in a way that is independent of any processing. I think voices that
claim entities are outdated are missing that argument.
. . . . . . . . Ken
--
Check our site for free XML, XSLT, XSL-FO and UBL developer resources |
Streaming hands-on XSLT/XPath 2 training @US$45: http://goo.gl/Dd9qBK |
Crane Softwrights Ltd. _ _ _ _ _ _ http://www.CraneSoftwrights.com/x/ |
G Ken Holman _ _ _ _ _ _ _ _ _ _ mailto:gkholman@CraneSoftwrights.com |
Google+ blog _ _ _ _ _ http://plus.google.com/+GKenHolman-Crane/posts |
Legal business disclaimers: _ _ http://www.CraneSoftwrights.com/legal |
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]