xml-dev - (Fwd) Abbreviated Format for XML

(Fwd) Abbreviated Format for XML
[ Lists Home | Date Index | Thread Index ]
From: Peter Murray-Rust <peter@ursus.demon.co.uk>
To: "'XML-DEV'" <xml-dev@xml.org>
Date: Sat, 01 Apr 2000 00:09:02 +0100
I am forwarding the following to XML-DEV on behalf of the W3C-AF activity.

>Reply-To: <avr@w3c.org> 
>From: "A.V.Ril" <avril@w3.org> 
>To: "Peter
Murray-Rust" <peter@ursus.demon.co.uk> 
>Subject: Abbreviated format
>Date:
Thu, 30 Mar 2000 13:48:16 +0100 
>Message-ID:
<001501bf9a49$b44e8c80$9999a8c0@p300> 
>MIME-Version: 1.0 
>Content-Type:
text/plain; 
>charset="iso-8859-1" 
>Content-Transfer-Encoding: 7bit

>X-Priority: 3 (Normal) 
>X-MSMail-Priority: Normal 
>X-Mailer: Microsoft
Outlook CWS, Build 9.0.2416 (9.0.2910.0) 
>X-MimeOLE: Produced By Microsoft
MimeOLE V5.00.2314.1300 
>In-Reply-To: <38E276C2.2404C26B@mitre.org>

>Importance: Normal 
>Precedence: bulk 
>
>Peter
>
>	Please could you forward the following to XML-DEV as I am not subscribed.

>	Many Thanks
>
>	Veronica
>--------------------------8X-------------------------
>
>The W3C's XML Activities include an Abbreviation Format Activity which 
>has been preparing a draft specification for XML compression. Normally this 
>activity is for members only at this stage, but in response to the
discussion on
>XML-DEV we have decided to make the first release of the draft available at:
>
>http://www.w3.org/NOTE/XML-AF-2000-04-01.html
>
> Please mail xml-af-list@w3.org with comments or queries.
>
> A. Veronica. Ril

I have checked this site but it is still password-protected. I have mailed
Veronica
but it will take a little while to unprotect so I have her permission to 
summarise. 

Since "XML *is* SGML" it is possible to use SGML minimisation techniques in 
reducing the number of markup characters in a document. As all XML documents 
are well-formed, there is no explicit need for end-tags, and these can be
replaced 

by newlines (technically REs or RSs in SGML - they are essentially the same,
but subtly different). Because this may make the element nesting ambiguous,
a DTD is prepended which defines unambiguous content models for every tag.

Since every end tag is replaced by a newline, all start tags are found at
the start of
lines and therefore the STAGO and STAGC characters ("pointy brackets") can be
removed (remember that GIs cannot contain whitespace). This is an
improvement, 
IMO, towards 

"XML shall be human-readable and reasonably clear"

since there are no angle brackets. Thus a document of the form:

<greetings>Hello World!</greetings>

is compressed to:

greetings Hello World!

For a document of the form:

<foo><bar>bar content</bar></foo>

we transform to:

<!DOCTYPE foo [
<!ELEMENT foo (bar)>
<!ELEMENT bar (#PCDATA)>
]>

foo
bar bar content

The nesting is unambiguous because of the content model. It can be 
proved by forest automata that all documents can be
reduced to unambiguous forms and a suitable DTD written. Software
to parse and maximise documents of this sort in to WF-XML already exists 
(nsgmls). Creating the compressed representation is merely a matter of
running nsgmls backwards. Since James Clark has made the code OpenSource
it is a trivial matter to reorder the code in the reverse direction and
recompile to
slmgsn. (There is no need to try to *understand* the code, which is beyond
mortals like
me anyway!).

We also create an "SGML declaration" in a separate file. This is also
compact since there 
are no vowels in it and all strings are <= 6 characters. It is therefore
very readable,

since vowels are redundant.

The process therefore consists of:

XML document --> slmgsn -->  XML-AF over the wire --> nsgmls --> XML document

The document is therefore compressed automatically (while still remaining
as valid SGML)
and then reconstituted into its original form by the pre-parser (nsgmls).

Further minimisation is possible. Since SGML forbids duplicated enumerated
attribute
values, the names of attributes can be omitted. By reverse compiling all
attributes into
enumerations in the DTD, no names, no equals signs and no quotes need be
included, which
saves a great deal of traffic. Thus:

<foo att2="d">fudge</foo>

can be minimised to:

<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
<!ATTLIST foo 
	att1 (a|b) #IMPLIED 
	att1 (c|d) #REQUIRED
]>

foo d fudge

again with an increase in human readability. The DTD can be used to
differentiate the content 
from the attribute value.

Note that in this case the document has to contain a "foo" element, so the
element name
can be omitted. The document now looks like:

d fudge

which is about as short as we can get. The human reader can easily work out
the tags and
attribute names from the DTD.

When long words occur repeatedly in the text, they can be minimised through
entities. Thus a 
long word like "internationalisation" can be defined as a text entity in
the DTD and referred to in the
text as &i18; This is another great saving in transmission and, because of
the shorter volume of
text, it is clearly more readable.

It may appear that the document has become minimised at the expense of the
DTD, XML-AF 
have suggested a clever way round this.  Common text entities are collected
into "entity sets" 
and these can be 
pre-distributed with XML parsers, browsers and other client-side software.
Various multilingual
dictionaries have been engineered in this fashion. Similarly, common DTDs
and Schemas will
be enhanced as XML-AF and since most of these will be built into the
browser anyway,
users will only need to send the minimised XML document.

To manage the entity sets, schemas and DTDs XML-AF have suggested the
concept of
a "catalog". This catalog can use URIs or FPIs to reference the entities to
be used. By
careful use of FPIs the actual entities sets need only be referenced, not
distributed.

Some documents, especially purchase orders, will become very common. In
this case substantial
parts of the purchase order will be "boilerplate" XML and will be
invariant. These can be defined
as larger entities and pre-installed on clients. In this way many documents
will consist of a few
entity references, enormously cutting down on traffic. Indeed, for repeat
orders it is only
necessary to send the URL for the DTD and a single entity reference.

I shall certainly be developing a CML version of this. Rather than
transmitting complete molecules
over the WWW, I now only need to transmit an entity reference on the
assumption that every client
will have (or be able to download) a DTD describing that molecule as an
entity.

Note that a purchase order might now look like:

Thomas Pynchon
3
Acme
01/02/03

It combines extreme terseness - no characters are wasted - with complete
human readability.
Everyone will be using the same schema for purchase orders (the one in the
current schema-0 
document, since no one has yet managed to work out how to write other
ones). The document
above is therefore unambiguous. A really exciting possibility is that
schemas themselves can be
similarly compressed. Since a schema *is* XML and since XML *is* SGML,
schemas will
be compressed to human-readable length.

This is, of course, only suggested as a compressed transfer format. However
its other virtues
(readability and compatibility with SGML) mean that it may even start
replacing XML V1.0
in critical places.

Note, of course, that conventional compression techniques (ZIP, LZW, etc.)
can still be 
applied to the result, which will normally be only a few bytes.

I commend the work of the XML-AF activity and look forward to seeing
implementations.

	P.







***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
Follow-Ups:
- Re: (Fwd) Abbreviated Format for XML
  - From: "Simon St.Laurent" <simonstl@simonstl.com>
- Re: (Fwd) Abbreviated Format for XML
  - From: "Simon St.Laurent" <simonstl@simonstl.com>
Prev by Date: LISTRIVIA (Re: need for defining standard APIs for xml storage)
Next by Date: Re: xml search engine?
Previous by thread: LISTRIVIA (Re: need for defining standard APIs for xml storage)
Next by thread: Re: (Fwd) Abbreviated Format for XML
Index(es):
- Date
- Thread