Re: [xml-dev] Re: why whitespace counts as a node?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Mukul Gandhi <gandhi.mukul@gmail.com>
To: xml-dev@lists.xml.org
Date: Sun, 14 Nov 2010 21:01:54 +0530

I think the issue of treating white-spaces in XML documents get's
interesting when XML documents are validated by XML schema's.

Here are the various cases I can think of (with significance to white-spaces) :

1) If the XML document is parsed by a SAX parser, then the call-back
method "characters" (which get's notification of character data) will
get all the characters in character data (including the white spaces).

When XML documents are parsed by a DOM parser, text nodes still
contains all white-space contents.

Therefore XML parsing preserves white-space contents in the infoset
instance the parsing process produces. I think this is desirable in
plain XML parsing process, since applications may want to do something
with white-spaces too.

2) Things get little interesting when XML documents are validated by
say XML schema documents. Here are few examples:

a)
<x>
   100
</x>

Here the content of element "x" is numeric, but there are boundary
white-spaces around the numeric value 100.

This will be successfully validated by the following XML schema fragment,

<xs:element name="x" type="xs:integer" />

b)
<x>
   hello world
</x>

Here there are boundary white-spaces within element "x".

c)
<x>hello world</x>

Here there are no boundary white-spaces within "x".

The following XML schema fragment,

<xs:element name="x">
    <xs:simpleType>
	 <xs:restriction base="xs:string">
	      <xs:maxLength value="11" />
	 </xs:restriction>
    </xs:simpleType>
</xs:element>

would report XML document (b) as invalid while (c) as valid. This is
because with the schema type xs:string, white-space contents in XML
documents are considered significant (and that effects validity of
character content), while with numeric types such as xs:integer
white-spaces are not considered significant (and that's ignored by say
an XML schema validator).

On Sun, Nov 14, 2010 at 6:41 PM, Michael Kay <mike@saxonica.com> wrote:
>
>> Ok, so it does serve a purpose.  However, even in xhtml, if you want
>> white space in a paragraph of text, then you can put that whitespace
>> between tags.  I'm sure it's my lack of experience, but, for example,
>> when do you need that white space?
>>
> Once you accept the usefulness of inline markup like this:
>
> <p>I just <i>love</i> <place>London</place></p>
>
> then you have to accept that the space between "love" and "London" is just
> as significant as the one between "I" and "just".
>
> Some of the XML specs do try and recognize that whitespace in mixed content
> needs to be treated differently from whitespace in "element-only content"
> (like database dumps). But part of the XML philosphy is that XML instances
> can be used without having a schema or DTD, which means you don't always
> know whether it's mixed content or not. So you have to treat it as
> significant.
>
> This is one of the reasons it's best to avoid "non-standard" uses of mixed
> content like this:
>
> <date-of-birth>
> <source>birth-certificate</source>
>  1920-03-04
> </date-of-birth>
>
> Michael Kay
> Saxonica

-- 
Regards,
Mukul Gandhi

Follow-Ups:
- RE: [xml-dev] Re: why whitespace counts as a node?
  - From: "Michael Sokolov" <sokolov@ifactory.com>

References:
- why whitespace counts as a node?
  - From: Thufir Hawat <hawat.thufir@gmail.com>
- Re: [xml-dev] why whitespace counts as a node?
  - From: David Carlisle <davidc@nag.co.uk>
- Re: why whitespace counts as a node?
  - From: Thufir Hawat <hawat.thufir@gmail.com>
- Re: [xml-dev] Re: why whitespace counts as a node?
  - From: Michael Kay <mike@saxonica.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]