XML Schema Datatypes - token, string, normalizedString

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

XML Schema Datatypes - token, string, normalizedString - any difference?

From: "Costello, Roger L." <costello@mitre.org>
To: <xml-dev@lists.xml.org>
Date: Fri, 16 Feb 2007 07:29:39 -0500

Hi Folks,

Below is a rather intriguing thing I discovered yesterday about the XML
Schema datatypes token, string, and normalizedString.

Consider these three declarations of element Title:
 
      <element name="Title" type="token"/>

      <element name="Title" type="string"/>

      <element name="Title" type="normalizedString"/>

Note that each declaration uses a different datatype - token, string,
normalizedString.

Consider this instance of Title:

      <Title>_______</Title>

Will the above declarations produce any differences with regards to
validation?  That is, are there certain values that yield "valid" with
one datatype, but "invalid" with the other datatypes?

Scroll down to see the answer ....

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

One would intuitively think, "Of course they produce different
validation results.  After all, why have different datatypes if they
produce the same results."

In fact they all yield the same validation results!

For example, this yields "valid"

       <Title>My Life and 
       Times</Title>

for all three declarations. And this yields "invalid"

      <Title>&#x0;</Title>

for all three declarations.

No matter what value you put within <Title> all three datatypes yield
the same result.

Pretty weird, right?

Want to know why all three datatypes yield the same validation result?
Scroll down ...



















Here's what the datatypes specification says:

normalizedString: The value space of normalizedString is the set of
strings that do not contain the carriage return (#xD), line feed (#xA)
nor tab (#x9) characters.

token: The value space of token is the set of strings that do not
contain the carriage return (#xD), line feed (#xA) nor tab (#x9)
characters, that have no leading or trailing spaces (#x20) and that
have no internal sequences of two or more spaces.

From these descriptions, you might conclude that this is an illegal
normalizedString value

       <Title>My Life and 
       Times</Title>

since a normalizedString cannot have a carriage return.

And you might conclude that this is an illegal token

       <Title>  My Life and Times   </Title>

since a token cannot have leading/trailing spaces.

However, they are both legal.  Here's why:

The default value of the whitespace facet for normalizedString is
"replace".  This means that before validating this instance

       <Title>My Life and 
       Times</Title>

the carriage return is replaced with a space, to produce

       <Title>My Life and Times</Title>

And this is clearly a valid normalizedString.

Likewise, the default value of the whitespace facet for token is
"collapse".  This means that before validating this instance

       <Title>  My Life and Times   </Title>

the leading/trailing spaces are removed, to produce

       <Title>My Life and Times</Title>

And this is clearly a valid token.

Summary:

With regards to validation, these three forms are identical:

      <element name="Title" type="token"/>

      <element name="Title" type="string"/>

      <element name="Title" type="normalizedString"/>

Note: the PSVI for the three forms are different (i.e., the PSVI will
tell you in the first case that the datatype is token, it will tell you
in the second case the datatype is string, and so forth).  If you are
using Relax NG (which does not generate a PSVI) then the three forms
are identical in every way.

Thanks to Michael Kay, George Bina and Jerry Sheehan for explaining
this to me.

/Roger

Follow-Ups:
- Re: [xml-dev] XML Schema Datatypes - token, string, normalizedString - any difference?
  - From: "derek denny-brown" <zuligag@gmail.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]