xml-dev - Re: Schemas and Other Crucial XML Questions

Re: Schemas and Other Crucial XML Questions

[ Lists Home | Date Index | Thread Index ]

From: Tyler Baker <tyler@infinet.com>
Date: Mon, 10 Aug 1998 14:56:16 -0400

David Megginson wrote:

> Sam Gentile writes:
>
>  > > Also, we have been hearing rumors of a "short" XML notation. Is
>  > > there one?  We have a need to reduce the size of our buffers.
>
> No, there is no such thing.  XML's parent, SGML, included extensive
> facilities for markup minimisation and has suffered badly for it,
> since SGML tools are far too difficult to write (there is still not a
> single Java-based SGML parser, beside probably more than a dozen
> Java-based XML parsers).
>
> There are, however, alternatives: for example, you could compile the
> XML to a compact binary format for internal storage then decompile it
> back to a verbose format for export -- there's no requirement to store
> it internally as text.

Simple some very simple compression algorithms like Huffman encoding for
instance, do very well with XML documents as the Name production that is used for
identifying tags among other things will be converted to some binary symbol that
is used as an index to lookup the actual name production.  In fact, you could do
this all with entities by simply taking all of the Names specified in the DTD,
spit them into a List, and then declare all entities.

You could index all of this by using base 10 digits or else use something as high
as base 64 to encode the array references.

<!ENTITY % 0 "Foo">
<!ENTITY % 1 "Bar">

Then for a document which had element types with names "Foo" and "Bar" occurences
of:

<foo></foo>
<bar></bar>

would be converted to:

<0></0>
<1></1>

For small documents like CDF for instance these sort of techniques may turn out
to be counter-productive.

Tyler

BTW, on a side-note I am having a problem understanding whether the external
subset or the internal subset should be parsed first.  I would assume that the
external subset should go first, but in this case it would make using INCLUDE and
IGNORE sections to be pretty useless.  This is something that is not clarified as
far as I can tell in the 1.0 spec so if someone could clarify how this should be
handled by a parser, then I would greatly appreciate it.

Thanx in advance...

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Re: Schemas and Other Crucial XML Questions
  - From: David Megginson <david@megginson.com>
- Re: Schemas and Other Crucial XML Questions
  - From: Tyler Baker <tyler@infinet.com>

References:
- Schemas and Other Crucial XML Questions
  - From: David Megginson <david@megginson.com>
- RE: Schemas and Other Crucial XML Questions
  - From: "Sam Gentile" <samg@fundtech.com>
- RE: Schemas and Other Crucial XML Questions
  - From: David Megginson <david@megginson.com>

Prev by Date: RE: Namespaces and XML validation
Next by Date: Re: Schemas and Other Crucial XML Questions
Previous by thread: RE: Schemas and Other Crucial XML Questions
Next by thread: Re: Schemas and Other Crucial XML Questions
Index(es):
- Date
- Thread