Re: [xml-dev] A few questions about building an XML parser

On Wed, 9 Mar 2022 at 13:02, Roger L Costello <costello@mitre.org> wrote:

Hi Folks,

For learning purposes (and for fun) I want to build an XML parser.

While an XML parser is not a compiler, I think that an XML parser performs the same steps as the front end of a compiler.

I am reading a compiler book [1] and it says this:

---------------------------------------------------

The front end can be divided into lexical analyzer, syntax analyzer, and semantic analyzer. The lexical analyzer, sometimes also called the scanner, carries out the simplest level of structural analysis. It will group the individual symbols of the source program text into their logical entities. Thus the sequence of characters ‘W’, ‘H’, ‘I’, ‘L’, and ‘E’ would be identified as the word ‘WHILE’ and the sequence of characters ‘1’, ‘.’, and ‘0’ would be identified as the floating-point number 1.0.

The syntax analyzer, often also called the parser, analyzes the overall structure of the whole program, grouping the simple entities identified by the scanner into the larger constructs, such as statements, loops, and routines, that make up the complete program.

Once the structure of the program has been determined we can then analyze its meaning (or semantics). We can determine which variables are to hold integers, and which to hold floating point numbers, we can check that the size of all arrays is defined and so on.

---------------------------------------------------

Okay, back to XML. Consider this non-well-formed XML:

<Publisher>Harper&Row</Publsher>

(The end-tag is misspelled)

What stage should the entity & be converted to &?

Lexical analysis stage
Syntax analysis stage
Semantic analysis stage

What stage should detect that the <Publisher> start-tag does not have a matching end-tag?

Lexical analysis stage
Syntax analysis stage
Semantic analysis stage

Not shown in the example, but what stage should convert <!CDATA[Hello, World]]> to Hello, World?

Lexical analysis stage
Syntax analysis stage
Semantic analysis stage

Some background information: Flex is a lexer generator; that is, it is a tool for generating lexical analyzers. The Flex manual shows an example [2] of a lexer that scans a string which is enclosed in quotes. For this input:

    "Hello\040World"

the lexical analyzer generates this token:

    Hello World

Notice that the octal entity ( \040 ) has been resolved to its character (the space character). That example leads me to conclude that a lexical analyzer is responsible for converting XML entities, e.g.,

    The lexical analyzer converts & to &

However, the Flex manual showed that a lexer “could” resolve an octal entity, but the manual didn’t say that the lexer “should” resolve entities, so I don’t know it is appropriate for the lexer to convert XML entities. What are your thoughts on this?

the \040 example is more like (although still possibly misleading)

a & b

with a character (not entity) reference.

An entity reference is a named reference to a typically user defined entity

<!ENTITY wibble "1<b>2</b>3" >

....

a &wibble; b

is more like

wibble="1<b>2</b>3"

"a " + wibble + " b"

so not resolved by the parser or at least certainly not by the lexical analysis, amp happens to be a pre-defined entity but that doesn't really make much difference to the lexical analysis of the entity reference & - it is structurally the same as a reference to a document-defined entity.

David

/Roger

[1] “Introduction to Compiling Techniques” by J.P. Bennett

[2] See page 24, https://epaperpress.com/lexandyacc/download/flex.pdf