Re: [xml-dev] When did you vanish, o' CDATA section wrapper?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Norman Gray <norman@astro.gla.ac.uk>
To: "Costello, Roger L." <costello@mitre.org>
Date: Tue, 21 Aug 2012 18:56:04 +0100

Roger, hello.

On 2012 Aug 21, at 18:13, Costello, Roger L. wrote:

> I gave the document to an XML parser. The XML parser ingested the document and then gave its output to an XML application (e.g., an XSLT processor or an XML Schema validator).
>
> document ---> XML Parser ---> XML Application

When something like an XML document is parsed, the sequence is something like this (this applies to a much larger class of documents than XML, and the details might vary (vary quite a lot in the case of XML parsers), but this is broadly right).

document
--stream-of-bytes-->
lexer
--stream-of-tokens-->
parser
--stream-of-events-->
application

The lexer does 'lexical analysis', and the parser does 'syntactic analysis'.

The 'lexical analysis' is the process which examines a sequence of bytes, spots important things, and passes an abstraction up the chain. The lexical analysis will spot a '<' and create a 'markup-declaration-open' token, or it'll spot a '</' pair and create a suitable token, it'll spot a sequence of characters and create a 'string' token, and so on. If it finds a sequence of characters like "<='foo" hello, world" it won't care a bit, but generate the corresponding sequence of tokens and pass them on. The only thing a lexer will object to is a _character_ which isn't allowed in XML, such as a control character, ^G.

The syntactic analysis receives that stream of tokens and decides if it's a legal stream. For example, a markup-declaration-open token must be followed by a string, so if it's followed by an 'equals-sign' token, it's this layer which will object. If it receives a legal string of tokens, it turns them into higher-level abstractions which it passes on to the application. For example (taking Java as a convenient example), the org.xml.sax.ContentHandler interface <http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html> illustrates the set of things which an XML parser might tell an application. In this example, a parser might tell an application "I've seen an element start-tag corresponding to element X, with attributes Y", or "I've seen a list of characters". Note that nowhere in this interface is there any mention of CDATA sections or entities, because the parser, and a fortiori the application, simply doesn't see them. Your CDATA section has disappeared before this.

It's the lexical analysis which consumes it. If I write in a document "a&lt;b" or "<![CDATA[a<b]]>", it's the lexical analyser's job to handle the escaped character, in the first case, and the CDATA section in the second, and in _both_ cases to generate a string "a<b" which it passes up to the syntax layer. The syntax layer doesn't have to care how this was represented in the bytes of the document -- all it gets is a string.

Thus CDATA sections are consumed/interpreted at the same layer as entity references are consumed/interpreted.

(Because of the way that XML is defined, the boundary between the lexer and the syntax is slightly blurrier than I've suggested here, but there'll be something corresponding to this distinction in just about any XML parser)

All the best,

Norman

--
Norman Gray : http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

References:
- When did you vanish, o' CDATA section wrapper?
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]