Lists Home |
Date Index |
- From: Peter Murray-Rust <email@example.com>
- To: "'XML-DEV'" <firstname.lastname@example.org>
- Date: Sat, 01 Apr 2000 00:09:02 +0100
I am forwarding the following to XML-DEV on behalf of the W3C-AF activity.
>From: "A.V.Ril" <email@example.com>
>Subject: Abbreviated format
Thu, 30 Mar 2000 13:48:16 +0100
>X-Priority: 3 (Normal)
Outlook CWS, Build 9.0.2416 (9.0.2910.0)
>X-MimeOLE: Produced By Microsoft
> Please could you forward the following to XML-DEV as I am not subscribed.
> Many Thanks
>The W3C's XML Activities include an Abbreviation Format Activity which
>has been preparing a draft specification for XML compression. Normally this
>activity is for members only at this stage, but in response to the
>XML-DEV we have decided to make the first release of the draft available at:
> Please mail firstname.lastname@example.org with comments or queries.
> A. Veronica. Ril
I have checked this site but it is still password-protected. I have mailed
but it will take a little while to unprotect so I have her permission to
Since "XML *is* SGML" it is possible to use SGML minimisation techniques in
reducing the number of markup characters in a document. As all XML documents
are well-formed, there is no explicit need for end-tags, and these can be
by newlines (technically REs or RSs in SGML - they are essentially the same,
but subtly different). Because this may make the element nesting ambiguous,
a DTD is prepended which defines unambiguous content models for every tag.
Since every end tag is replaced by a newline, all start tags are found at
the start of
lines and therefore the STAGO and STAGC characters ("pointy brackets") can be
removed (remember that GIs cannot contain whitespace). This is an
"XML shall be human-readable and reasonably clear"
since there are no angle brackets. Thus a document of the form:
is compressed to:
greetings Hello World!
For a document of the form:
we transform to:
<!DOCTYPE foo [
<!ELEMENT foo (bar)>
<!ELEMENT bar (#PCDATA)>
bar bar content
The nesting is unambiguous because of the content model. It can be
proved by forest automata that all documents can be
reduced to unambiguous forms and a suitable DTD written. Software
to parse and maximise documents of this sort in to WF-XML already exists
(nsgmls). Creating the compressed representation is merely a matter of
running nsgmls backwards. Since James Clark has made the code OpenSource
it is a trivial matter to reorder the code in the reverse direction and
slmgsn. (There is no need to try to *understand* the code, which is beyond
We also create an "SGML declaration" in a separate file. This is also
compact since there
are no vowels in it and all strings are <= 6 characters. It is therefore
since vowels are redundant.
The process therefore consists of:
XML document --> slmgsn --> XML-AF over the wire --> nsgmls --> XML document
The document is therefore compressed automatically (while still remaining
as valid SGML)
and then reconstituted into its original form by the pre-parser (nsgmls).
Further minimisation is possible. Since SGML forbids duplicated enumerated
values, the names of attributes can be omitted. By reverse compiling all
enumerations in the DTD, no names, no equals signs and no quotes need be
saves a great deal of traffic. Thus:
can be minimised to:
<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
att1 (a|b) #IMPLIED
att1 (c|d) #REQUIRED
foo d fudge
again with an increase in human readability. The DTD can be used to
differentiate the content
from the attribute value.
Note that in this case the document has to contain a "foo" element, so the
can be omitted. The document now looks like:
which is about as short as we can get. The human reader can easily work out
the tags and
attribute names from the DTD.
When long words occur repeatedly in the text, they can be minimised through
entities. Thus a
long word like "internationalisation" can be defined as a text entity in
the DTD and referred to in the
text as &i18; This is another great saving in transmission and, because of
the shorter volume of
text, it is clearly more readable.
It may appear that the document has become minimised at the expense of the
have suggested a clever way round this. Common text entities are collected
into "entity sets"
and these can be
pre-distributed with XML parsers, browsers and other client-side software.
dictionaries have been engineered in this fashion. Similarly, common DTDs
and Schemas will
be enhanced as XML-AF and since most of these will be built into the
users will only need to send the minimised XML document.
To manage the entity sets, schemas and DTDs XML-AF have suggested the
a "catalog". This catalog can use URIs or FPIs to reference the entities to
be used. By
careful use of FPIs the actual entities sets need only be referenced, not
Some documents, especially purchase orders, will become very common. In
this case substantial
parts of the purchase order will be "boilerplate" XML and will be
invariant. These can be defined
as larger entities and pre-installed on clients. In this way many documents
will consist of a few
entity references, enormously cutting down on traffic. Indeed, for repeat
orders it is only
necessary to send the URL for the DTD and a single entity reference.
I shall certainly be developing a CML version of this. Rather than
transmitting complete molecules
over the WWW, I now only need to transmit an entity reference on the
assumption that every client
will have (or be able to download) a DTD describing that molecule as an
Note that a purchase order might now look like:
It combines extreme terseness - no characters are wasted - with complete
Everyone will be using the same schema for purchase orders (the one in the
document, since no one has yet managed to work out how to write other
ones). The document
above is therefore unambiguous. A really exciting possibility is that
schemas themselves can be
similarly compressed. Since a schema *is* XML and since XML *is* SGML,
be compressed to human-readable length.
This is, of course, only suggested as a compressed transfer format. However
its other virtues
(readability and compatibility with SGML) mean that it may even start
replacing XML V1.0
in critical places.
Note, of course, that conventional compression techniques (ZIP, LZW, etc.)
can still be
applied to the result, which will normally be only a few bytes.
I commend the work of the XML-AF activity and look forward to seeing
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:email@example.com&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/