OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

re: Request for info about parser construction details.

Jose Luis Sierra Rodriguez writes:

 > I would like to find some information about technical details
 > regarding the developing of parsers for SGML and XML. Of course
 > that the obvious solution is to examine the sources of some
 > existing parsers ;), but I would like to avoid myself this effort
 > by reading some higher-level documentation.  In particular, I'm
 > looking for the sort of things that you can find in a general
 > textbook about compiler construction (for instance, that of Aho,
 > Sethi and Ullman) but applied to the particular features of
 > SGML/XML. Any pointer on this subject will be wellcome.

Unfortunately, we all end up learning by doing.  Here are a few
pseudo-random tips that jumped to my fingers:

1. Recursive descent is just as fast as finite state and results in
small, easy-to-understand code; however, unlike a finite state parser,
a recursive-descent parser cannot easily save its state, return to the
caller, and then start parsing again where it was.  That means that
recursive-descent is suitable for a push interface (like SAX), but not
for a pull interface.

2. Too much copying can be brutal; if speed is your concern, try
working directly from your input buffer as much as possible rather
than copying parts of the buffer into strings or new arrays.  

3. A large input buffer makes for faster batch performance, but a
small buffer makes for better interactive performance (as always).

4. As a general principle, leave the heap alone and concentrate on the
stack.  Try to allocate everything you'll need at initialization time,
rather than doing it over and over in main loop (since the main loop
may go through tens of thousands of iterations in a second or two).
This is especially important for garbage-collection languages like
Java, since too much GC can grind your parser almost to a halt, but
the heap is expensive even in C/C++.

5. If you're doing Namespace processing, you have to read *all* of the
attributes before you can perform any processing on the element name
or qualified attribute names, since the xmlns* attribute could be the
last one.

6. Don't be afraid to place an upper limit on the amount of
character-data content you'll process at once.  You could end up with
the pathological case of an XML document containing a start tag, 1GB
of text, and an end tag -- there's nothing wrong with breaking the
text up into arbitrary segments.

7. It's a good idea to layer your parser into an event stream (SAX or
otherwise) and a tree API (DOM or otherwise), rather than having the
low-level parser build the tree directly.  Build tiny layers that are
easy to maintain and test.

8. Profile, profile, profile.  You'll be amazed at where the
bottlenecks turn up.

Of course, the standard advice -- document heavily, incorporate unit
tests into your standard build procedures, etc. -- also applies.

All the best,


David Megginson                 david@megginson.com