[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] A question for parsing experts: How to recognize that'<' denotes the beginning of a start tag?
- From: "Liam R. E. Quin" <liam@fromoldbooks.org>
- To: Roger L Costello <costello@mitre.org>, "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Tue, 16 Feb 2021 15:20:02 -0500
On Tue, 2021-02-16 at 17:52 +0000, Roger L Costello wrote:
>
>
> In the scanning process, you encounter a less than ( '<' ) symbol
> You must determine if it denotes the beginning of a start tag.
Wellm i did badly on the last parsing question, let's see if i can do
badly here too :) again before coffee!
>
> Let c = the character currently being examined.
> Let nextchar = the character following c
>
> if c == '<' and nextchar != '/' and nexchar != '!' and nextchar !=
> '?' then we are at the beginning of a start tag
>
> Do you agree? Am I missing any checks?
You need to apply the test in the right place - you're not ging to see
a start tag inside an attribute value or comment or CDATA section or in
the internal subset outside of an entity replacement value (< is
notallowed unescaped in system or public identifiers).
If you do encounter a < in those other contexts, the input is not well-
formed. In places (e.g. public identifiers) the grammar enforces this;
elsewhere (e.g. system identifiers) it's made explicit in the prose.
In entity replacement texts, you don't want to tokenize until the
entity is actually used.
Also, you only have a start-tag (as the spec calls them) if nextchar is
a name start character. For example, <
boy
>
is not allowed, but
<girl
>
is fine is as
<enby>
Liam
--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]