Re: [xml-dev] Parser/Parser generator in XSLT 3

On Sat, 23 Jun 2018 at 11:08 am, Andrew Sales <andrew@andrewsales.com> wrote:

Hi Tom, it's an interesting idea, but what is the driver for using XSLT to do this (not against, just curious)?

The SAX API callbacks provide access to DTD information, so tools such as DTDinst[1] and dtd2xml[2] produce an XML document containing that information. Entities are resolved by the parser.
Other similar tools may be available.

A transform can then consume such a document if it needs to.

Regards,
Andrew

[1] http://www.thaiopensource.com/dtdinst/
[2] https://github.com/AndrewSales/dtd2xml

On 23 June 2018 at 09:50, yamahito <yamahito@gmail.com> wrote:
Hi Folks,

I find myself in need of an XSLT parser for DTDs for a side project I'm playing with; since my background isn't really in computer science, I lack the formal education on parsers, so I hope someone here can tell me if my thoughts make sense, or point me at some resources to fill in the gaps!
In the short term, I want to write a DTD parser; longer term, I think it would be interesting to make a parser generator.

There is a parser generator out there already that can create XSLT parsers from EBNF grammars (http://www.bottlecaps.de/rex/ by Gunther Rademacher), which seems very good: Jirka Kosek and Steven Pemberton have talked about using it at Balisage and XML London, e.g. to check validity on non-XML fragments using schematron. I think there are two areas for improvement, however:
The XSLT produced is a good example of functional programming; it uses functions that pass a state variable between them, etc. However I wonder if there isn't a lot opportunity to make use of the underlying declarative paradigm of XSLT, particularly with the new data structures and abilities of XSLT 3.
Because of the above, it's harder to extend the parser, e.g. by inclusion in another XSLT. My use case for this is for a DTD parser which resolves entities.
As I understand it, most parsers include/depend upon a lexer, which tokenises the text, and then the parser builds the tokens into a hierarchy. It does this by building a transition table that states, for a given context, which tokens can be validly expected to occur within.

It seems to me that a separate lexer and a transition table approach is trying to replicate functionality that's already in place in XSLT; what problems can people foresee with using modes and template matches rather than a transition table?

Thanks,
Tom