xml-dev - Re: [xml-dev] The subsetting has begun

Re: [xml-dev] The subsetting has begun

[ Lists Home | Date Index | Thread Index ]

To: XML DEV <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] The subsetting has begun
From: "W. E. Perry" <wperry@fiduciary.com>
Date: Wed, 26 Feb 2003 15:52:44 -0500
Organization: Fiduciary Automation
References: <5.1.0.14.0.20030226200111.01e9bd48@mail.propylon.com>

Sean McGrath wrote:

> The instance is UnicodeWithAngleBrackets for sure. But an XML compliant
> parser much turn this mixture into a tree. If it can't, surely, the instance
> is not WF? I don't see how a parser can match production [1] of the XML spec.
> without turning the UnicodeWithAngleBrackets into a tree. The tree might be
> communicated in its entitity to the application (a la DOM) or in a stream of
> events (a la SAX) but there is always a tree there.

An XML-compliant parser *must* not turn an XML instance into any one particular
output. Production [1] is the syntactic criterion that *input* to an XML parser
must meet to be accepted as a document, which the Rec requires. Nothing in that
production (nor in any other) says anything about the form of output that a
parser must give to an entity matching this definition of document. And that is
just the point. Different styles of parsers natively produce different styles
of output. It could not be otherwise; parsers are processors like any other and
like all processors give a form to their output which reflects a particular
understanding of it. Subsequent users of that output are not obliged to bring
that same particular understanding to their own processing of it. To say that a
processor is general-purpose is to say that the form which it gives to its
output does not preclude any subsequent use of that same output understood in
entirely different terms. In practice, there will be uses of parser output
which will be precluded by the form which the parser has given to that output.
This is inevitable in the specific implementation of processors, whether
parsers or any other. In such cases, that particular parser will not be
sufficiently general purpose for that particular subsequent process, but the
difficulty can be cured by changing to a different style of parser whose native
output is sufficiently general to the subsequent process required. Just such
considerations will often decide whether a SAX or a DOM or some other style of
parser is appropriate to a particular case. It does not mean however that '
there is always a tree there'. Perhaps in either the case of SAX or DOM a tree
can be built if that is what a process subsequent to parsing chooses to do, but
in terms of processing the input XML instance a SAX parser emits SAX events and
a DOM parser renders a data structure defined by its particular DOM.

> At one level of interpretation - mid-parse as it were - prior to entity
> expansion, the parsers internal model might have shared sub-trees given than
> the same entity an occur more than once. But, passed entity resolution - the
> stuff passed on to the application - is be a tree.

I would argue that there is no visible 'mid-parse' which we might reasonably
discuss. There is only the input XML instance and, if it survives draconian
error handling, there is the particular output in the particular style of the
parser. That output is the transitional state, and though in many cases a tree
might be instantiated upon it, it is itself of a form native to the style of
parser.

> The beauty of always starting with the UnicodeWithAngleBrackets is that it
> forces a separation between the process-specific and that which is innate in
> the data.

Amen. And from the *parser's* perspective (as opposed to that of some
subsequent processor) all that is innate in the data is compliance with WFCs,
perhaps VCs, or the lack of it.

> In SGML, we had a name for the latter "markup-aware" as distinct from
> "structure controlled".

I believe that I understand the distinction. In performing its job qua parser
the parser is necessarily markup-aware. It cannot be structure controlled
because the structure which you are expecting to find is instantiated on the
output of parsing, not inherent in the (pre-parsed!) input instance. The
distinction which you make is really a distinction in what various post-parsing
processes should operate upon, giving their particular natures.

Respectfully,

Walter Perry

References:
- Re: [xml-dev] The subsetting has begun
  - From: Sean McGrath <sean.mcgrath@propylon.com>

Prev by Date: Re: [xml-dev] Registered Namespace prefixes
Next by Date: RE: [xml-dev] Registered Namespace prefixes
Previous by thread: Re: [xml-dev] The subsetting has begun
Next by thread: Re: [xml-dev] The subsetting has begun
Index(es):
- Date
- Thread