xml-dev - Re: [xml-dev] Guidelines for handling of elements' content?

Re: [xml-dev] Guidelines for handling of elements' content?

[ Lists Home | Date Index | Thread Index ]

To: Ralf <ralfml@alfray.com>
Subject: Re: [xml-dev] Guidelines for handling of elements' content?
From: Jeni Tennison <jeni@jenitennison.com>
Date: Wed, 24 Sep 2003 10:57:26 +0100
Cc: xml-dev@lists.xml.org
Envelope-to: xml-dev@lists.xml.org
In-reply-to: <3F712BA5.8040202@alfray.com>
Organization: Jeni Tennison Consulting Ltd
References: <3F712BA5.8040202@alfray.com>
Reply-to: Jeni Tennison <jeni@jenitennison.com>

Hi Ralf,

> A typical example that confuses me:
>
> <text>    this
>    is a text   .
>      </text>
>
> What should I interpret here? One straight line, one line with one
> \n in the middle, or one with thre \n ? What about the spaces before
> "is" and "this"?

You're talking here about text within a text-only element (one that
doesn't contain any other elements).

An XML Schema schema (or a RELAX NG schema, if you're using the XML
Schema datatypes) should make it clear how the whitespace in a
text-only element should be interpreted by an application. Or, put
another way, the Post Schema-Validation Infoset will contain a
"normalized value" for the element; an application should use this
normalized value rather than the original set of characters in the
element's content.

[Interestingly, the value of the xml:space attribute seems to be
completely ignored by XML Schema when constructing a normalized
value...]

How whitespace is treated is determined through the whiteSpace facet
of a datatype, which has three possible values:

  - preserve: keep all the whitespace

  - replace: replace all whitespace characters with spaces
  
  - collapse: as replace, then strip leading and trailing whitespace
    and collapse runs of spaces to a single space

If, under preserve, the value of the <text> element was:

  "   this&#xA;&#x9;is a text   .&#xA;&#x9;  "

[Note &#xA; is newline, &#x9; is tab.]
  
Under replace, it would be:

  "   this  is a text   .    "

and under collapse, it would be:

  "this is a text ."

[These kinds of whitespace normalization are inherited from the
treatment of whitespace in attributes in XML: the values of CDATA
attributes (and undeclared attributes) undergo 'replace' normalization
while the values of other types of attributes undergo 'collapse'
normalization.]
    
Of the XML Schema datatypes, most have whiteSpace=collapse, which
means that if you have an element like:

  <date xsi:type="xs:date">
    2003-11-24
  </date>

then the leading and trailing whitespace around the xs:date value are
ignored. (If they weren't ignored, then it wouldn't be a legal xs:date
value, because leading and trailing whitespace aren't allowed in
lexical representations of xs:date values.)

[Note that XPath's normalize-space() function also does collapse; to
do replace you need to use the translate() function.]

This default collapsing of whitespace is what you want when an element
contains a number or date. When you're looking at elements that
contain strings, you have three main choices of datatype that
correspond to the three levels of whitespace processing:

  - xs:string: preserves
  - xs:normalizedString: replaces
  - xs:token: collapses

What I'd advise is to usually use xs:token (collapsing whitespace).
When you have elements that contain things like people's names or a
brief description, you don't really care about whitespace. When
whitespace *is* significant, for example if you have a <poem> element
or a <code> element, in both of which lines and indents are
meaningful, then use the type xs:string (preserving whitespace). I
have yet to think of a good reason to use xs:normalizedString, but
perhaps someone else here can think of one.

Obviously the three levels of whitespace processing that are offered
by XML Schema aren't the only ones that you could possibly choose. For
example, the processing that you've adopted is:

> Currently what I do is ignore any whitespace (space, \t and \n)
> before and after the non-whitespace content. Everything inside I
> keep.

and in the "Holus Bolus" datatype library [1], John Cowan suggested
having a "remove" option, in which all whitespace gets removed.

There's nothing wrong with these other kinds of whitespace processing,
if that's what you need for a particular element. However, personally
I'd stick with using one of the three standard methods of whitespace
normalization when processing text-only elements (and I'd usually use
'replace' rather than 'collapse' if I wasn't preserving all
whitespace).

Cheers,

Jeni

[1] http://lists.usefulinc.com/pipermail/lextypes/2003-July/000004.html

---
Jeni Tennison
http://www.jenitennison.com/

References:
- Guidelines for handling of elements' content?
  - From: Ralf <ralfml@alfray.com>

Prev by Date: Re: [xml-dev] Guidelines for handling of elements' content?
Next by Date: Resolvable Namespace URLs
Previous by thread: Re: [xml-dev] Guidelines for handling of elements' content?
Next by thread: ANN: New RELAX NG users mailing list
Index(es):
- Date
- Thread