OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: Classification: XML Parser Features

[ Lists Home | Date Index | Thread Index ]
  • From: Tim Bray <tbray@textuality.com>
  • To: David Megginson <ak117@freenet.carleton.ca>
  • Date: Fri, 12 Dec 1997 17:08:09 -0800

At 12:17 PM 12/12/97 -0500, David Megginson wrote:
>Creating a truly well-formed parser is very, very difficult, because
>of the enormous number of constraints imposed both explicitly and
>implicitly by the grammar (I could probably write a full SGML parser
>with about the same level of effort, especially if I limited myself to
>a single, simple SGML declaration).

To start with, "full SGML parser" is directly contradictory to "a single
SGML declaration" - abstract syntax in fact being one of the things
that makes a full parser hard to write.

As to David's main point, that a WF parser is hard to write, I don't
agree; most of the work can be done in the low-level lexer, the number
of constraints that require ad-hoc code is pretty small.  Two things
are in fact hard, it seems:

1. handling multiple input encodings, and
2. making it run real fast while you're doing #1.

These don't really bother me that much as we are in the infancy of 
learning what the right way is to build truly internationalized
software; for example, I can parse the UTF16 Japanese version of the
XML spec in a few seconds; then it takes the best part of a minute
to load the .ttf for the Unicode font so you can look at anything;
so we have a few problems in this area.

Having said that, I am now in the middle of coding up validation for
Lark, and there are a TREMENDOUS NUMBER of irritating little
details about that.  No rocket science at all, but the code is going
to be substantially larger than the rest of Lark and it's all real
code; more than half of Lark is compressed parser tables.

Mind you, the validator is in a separate package and can be bypassed, so 
Lark effectively need be no larger.  But still; I wonder if validation
is intrinsically hard or we could have found a better 80/20 point? -Tim

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS