OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Request for info about parser construction details.

From: Jose Luis Sierra Rodriguez <jlsierra@sip.ucm.es>

>> I would like to find some information about technical details regarding
>> developing of parsers for SGML and XML.

Writing a general parser for SGML is difficult because it is more like a
compiler-compiler (like YACC) than a language-per se.

First there is an "SGML declaration" language in which one specifies which
character sets and mappings you are using, allocate characters to various
abstract roles (which ones can be separator characters, which can be used in
references) and various concrete delimiters (start-tag open) and the general
parsing features of the language (if you leave out certain tags, or justhave
them in reduced for such as <> or </> what rules does the parser use to fill
in the gaps).  Then there is a second language for markup declarations
(DTDs) which tells you not only which elements and attributes can go where,
but which elements can have start-or-end tags that can be omitted, at which
point "short-reference" maps come into play (where some string of characters
you specified in the SGML declaration can be used in the place of an entity
reference and add some tag), and declare entities and the attributes that
can appear on entities (keyed by the entity notation).   Finally comes the
instance language itself, and its also not straighforward: the entity
structure is not synchronous with the element structure and potentially you
can have subdocuments with completely different DTDs nested inside (like
namespaces but with their own ID scope) and you have to keep track of more
things, such as the current value of attributes marked #CURRENT (the
attribute has the most recent value in document order if it is not
specified) and global exclusions (such as that an <a> cannot contain an <a>
which the DTD has special structures for.  And in the DTD and instance there
are entity references which will not necessarily fit in with a simple-minded
approach to grammars that a beginner might hope for.

So full SGML really requires three separate parsers, each supplying lots of
parameters to the next.  This is because full SGML was designed to allow
clear description of lots of different kinds of markup languages, not just
well-formed.  (Actually, you do not need to support variations in SGML
declaration to be conforming SGML, as long as you document what you provide
in an SGML declaration and support at least the minimum Concrete Reference
Syntax, which gives default rules that are closer to HTML's requirements and
are too restrictive.  An XML system is not a "conforming SGML system" but it
is an "SGML system", but these are technical terms of conformance which make
bore and confuse people.)

So can you see why XML was invented?  Instead of Charles Goldfarb's unhappy
and forced starting position that people could never agree on syntaxes (see
MS' versions of HTML dumped from recent software, and SML-DEV for recemt
evidence of this)  Jon Bosak started from with the idea "what if we could
get everyone to standardize on a particular profile of SGML...then we
wouldn't need highly parameterized document description languages (or at
least the description would be made once for all by the profile-creators not
by every user) and simple parsers could be written". The breakthrough in XML
is not the technology (lots of people have been doing stripped down SGML for
years) but the concensus Jon was able to get up.  (Of course, Jon could not
have gotten that agreement without there being a lot of lessons learned from
full SGML concerning which features are most useful.)

XML says freeze the SGML declaration (see James Clark's note at W3C for
this). Have character encoding handled by the entity manager and adopt
Unicode as the document character set.  Get rid of any features which
require the instance parser to accept parameters from the markup
declarations.  Make the markup declarations optional to use.  Make entities
nest with elements. etc.

XML is designed to be straight-forward to implement, with little connection
between its two languages (declarations and elements).

So if you want a one or two week project, implement XML. If you want a six
month to one year project, implement SGML.  Don't try to implement SGML
unless you have Goldfarb's "The SGML Handbook" (and you probably will
understand more of XML with that too.)

Rick Jelliffe