XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Retain or discard whitespace surrounding an element?

[Definition] Lexer: a tool that inputs a linear sequence of characters and assembles them into meaningful groups (tokens). A lexer is also called a scanner or a tokenizer.

Hi Folks,

In the following XML document, what is the content of the <Document> element? 

<Document>
    <Test>Hello, world</Test>
</Document>

Is it: 

(a) Just the <Test> element?
(b) The whitespace following <Document>, plus the <Test> element, plus the whitespace (newline) following </Test>?

Should a lexer discard or retain the whitespace surrounding the <Test> element?

The answer is this: The content of the <Document> element could be either (a) or (b). A lexer should or shouldn't retain the whitespace surrounding the <Text> element. It is ambiguous.

Yikes!

If the XML document must conform to this XML Schema:

<xs:element name="Document">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="Test" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

then the answer is (a). A lexer may safely discard the whitespace surrounding the <Test> element. The whitespace is not significant. Presumably the whitespace was placed there to make it easier for humans to read the document.

If the XML document must conform to this XML Schema:

<xs:element name="Document">
    <xs:complexType mixed="true">  /* Notice mixed="true" */
        <xs:sequence>
            <xs:element name="Test" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

then the answer is (b). A lexer may not discard the whitespace surrounding the <Test> element. The whitespace is significant. Presumably the whitespace has some special meaning to applications that process the XML document.

If the XML document is not associated with a schema (XSD, DTD, or RNG), then the answer is always (a) and the whitespace may be safely discarded.

So, sometimes the content of <Document> is one thing, sometimes it's another thing. This complicates lexers (and parsers) because they must have external, out-of-band knowledge about the document. Is that good language design?

/Roger


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS