Lists Home |
Date Index |
- From: "F. Chahuneau - General Manager" <fcha@Berger-Levrault.fr>
- To: firstname.lastname@example.org
- Date: Sat, 1 Mar 1997 19:19:24 +0100 (MET)
> ESIS doesn't retain everything from the original document(s) and I've
> been asking the experts what gets lost.
In case someone wants to get even more precise information, ESIS (Element
Structure Information Set) is fully defined in annex G of document
ISO/IEC/JCT1/SC18/WG8/N1035: Recommendations for a Possible Revision of ISO
8879 (SGML). You can find an exact replication of this passage in Charles
Goldfarb's "SGML Handbook" (Clarendon Press, 1990), pp 588 to 591.
> My rough summary is that > XML->ESIS loses:
> - comments (this matters if you want to edit the document or have
> it read by humans. However comments should not be used
> by machines - simply passed through)
> - entities. If your document includes entities such as &chapter1;
> these may be expanded and replaced by their contents. In
> this way some of the structure may be less clear
It's actually more complex than that.
SGML *text* entity references, whether entities are "internal" or
"external", are indeed fully expanded and you are not even notified this in
the ESIS event stream. Therefore, ESIS does not convey the "entity
structure" of an SGML document. This is, by the way, irrelevant to most
applications ... except for those, such as some SGML editors, whose purpose
is seen as being able to manipulate SGML documents without arbitrarily
altering their entity structure (in addition to their element structure).
External data entity references, internal SDATA and PI entity references
are signaled in the ESIS, while CDATA internal entity references are
expanded without being reported. This may appear as as bizarre design
choice, but there is something even more disturbing: in the case of
internal SDATA entity references, only the entity "replacement value" is
passed, not the entity "name". This of one of the reasons why ESIS
information, alone, does not allow to implement an "identity
transformation" for SGML documents, even when you don't care about the
physical decomposition of the document into several files (SGML entities).
Note that SDATA entity disappear in XML, so that THIS PROBLEM DISAPPEARS AS
> - conditional markup. If you use INCLUDE and/or IGNORE then the
> IGNORE'd sections won't come through and the INCLUDE'd
> ones won't be marked as such
> [I think that processing instructions come through OK?
> And that you can determine whether an attribute value was defaulted
> or not?]
Unfortunately not. This information is unavailable in ESIS, and you would
need to access some "DTD information set" to be able to recover it. Besides
attribute names and de facto values, the only side information you have in
ESIS is when the value for an #IMPLIED attribute has not been specified.
There is one more piece of information missing in ESIS, and which causes a
problem to implement an "identity transformation" for plain SGML documents:
you don't know WHICH ELEMENTS HAVE BEEN DECLARED #EMPTY in the DTD. You
may know when an element has null content, but you don't know whether this
is because it happens to be so (optional content) or because it can't have
any (declared #EMPTY). Therefore, you do not know whether you should output
an end tag for it or not. Again, you would need some "DTD information" to
disambiguate. Maybe not everyone realized it yet, but this *is* the one and
only reason why XML introduces this explicit </EMPTY/> syntax for empty
elements. This, again, makes this problem disappear with XML.
All in all, you can see that some design decisions in XML were precisely
motivated by the desire to make an ESIS event stream sufficient to
implement an identity transformation, even with no access to DTD
information. This is, of course, totally consistent with the idea that DTDs
should not be systematically needed for processing XML fragments.
Whether you work with an event stream or an abstract tree(*) is orthogonal
to this discussion: we are discussing about the *available* information,
not about the way it is represented. This does not mean that I see abstract
trees as useless, all the contrary (see my previous mail).
I hope I helped clarify what ESIS was.
(*): I use the term "asbtract tree" instead of "parse tree" to designate
the "tree of typed nodes with attributes" (you could also say "SGML object
tree", but this term to be somewhat overloaded these days...). From an SGML
parser's point of view, an SGML "parse tree" would have distinct nodes for
start tags and end tags, which are not what you are looking for when you
want a useful representation allowing to cut-and-paste SGML elements (seen
as atomic, typed text objects with attached properties).
_/ François CHAHUNEAU phone: [+33] 1 40 64 43 00 _/
_/ Directeur Général/General Manager _/
_/ AIS S.A. FAX: [+33] 1 40 64 43 10 _/
_/ 15-17 rue Rémy Dumoncel email: email@example.com _/
_/ 75014, Paris, FRANCE WWW: http://www.berger-levrault.fr _/
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to firstname.lastname@example.org the following message;
List coordinator, Henry Rzepa (email@example.com)