Re: [xml-dev] A question for parsing experts: How to recognize that'<' d

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] A question for parsing experts: How to recognize that'<' denotes the beginning of a start tag?

From: "Liam R. E. Quin" <liam@fromoldbooks.org>
To: Roger L Costello <costello@mitre.org>, "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Tue, 16 Feb 2021 15:20:02 -0500

On Tue, 2021-02-16 at 17:52 +0000, Roger L Costello wrote:
> 
> 
> In the scanning process, you encounter a less than ( '<' ) symbol
> You must determine if it denotes the beginning of a start tag.

Wellm i did badly on the last parsing question, let's see if i can do
badly here too :) again before coffee!

> 
> Let c = the character currently being examined.
> Let nextchar = the character following c
> 
> if c == '<' and nextchar != '/' and nexchar != '!' and nextchar !=
> '?' then we are at the beginning of a start tag
> 
> Do you agree? Am I missing any checks?

You need to apply the test in the right place - you're not ging to see
a start tag inside an attribute value or comment or CDATA section or in
the internal subset outside of an entity replacement value (< is
notallowed unescaped in system or public identifiers).

If you do encounter a < in those other contexts, the input is not well-
formed. In places (e.g. public identifiers) the grammar enforces this;
elsewhere (e.g. system identifiers) it's made explicit in the prose.

In entity replacement texts, you don't want to tokenize until the
entity is actually used.

Also, you only have a start-tag (as the spec calls them) if nextchar is
a name start character. For example, <
boy
>
is not allowed, but
<girl
>
is fine is as
<enby>

Liam

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

References:
- A question for parsing experts: How to recognize that '<' denotes thebeginning of a start tag?
  - From: Roger L Costello <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]