[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Retain or discard whitespace surrounding an element?
- From: Roger L Costello <costello@mitre.org>
- To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Mon, 27 Dec 2021 12:03:58 +0000
[Definition] Lexer: a tool that inputs a linear sequence of characters and assembles them into meaningful groups (tokens). A lexer is also called a scanner or a tokenizer.
Hi Folks,
In the following XML document, what is the content of the <Document> element?
<Document>
<Test>Hello, world</Test>
</Document>
Is it:
(a) Just the <Test> element?
(b) The whitespace following <Document>, plus the <Test> element, plus the whitespace (newline) following </Test>?
Should a lexer discard or retain the whitespace surrounding the <Test> element?
The answer is this: The content of the <Document> element could be either (a) or (b). A lexer should or shouldn't retain the whitespace surrounding the <Text> element. It is ambiguous.
Yikes!
If the XML document must conform to this XML Schema:
<xs:element name="Document">
<xs:complexType>
<xs:sequence>
<xs:element name="Test" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
then the answer is (a). A lexer may safely discard the whitespace surrounding the <Test> element. The whitespace is not significant. Presumably the whitespace was placed there to make it easier for humans to read the document.
If the XML document must conform to this XML Schema:
<xs:element name="Document">
<xs:complexType mixed="true"> /* Notice mixed="true" */
<xs:sequence>
<xs:element name="Test" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
then the answer is (b). A lexer may not discard the whitespace surrounding the <Test> element. The whitespace is significant. Presumably the whitespace has some special meaning to applications that process the XML document.
If the XML document is not associated with a schema (XSD, DTD, or RNG), then the answer is always (a) and the whitespace may be safely discarded.
So, sometimes the content of <Document> is one thing, sometimes it's another thing. This complicates lexers (and parsers) because they must have external, out-of-band knowledge about the document. Is that good language design?
/Roger
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]