OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: STAX Re: Abbreviated Tag Names



This inspired the following idea for a human readable (with some
counting) compression scheme:

Background:

1. What we are trying to compress is tag names and attribute names but
no data. (PCDATA or attribute values)
2. Names in XML can not start with digits.   [5] Name ::= (Letter | '_'
| ':') (NameChar)*

When writing the document to a stream the following algorithm is used:

Every time a Name is to be written, except for end tags wich are always
written as </>, the following rules apply in order:

1. If it is the same name as the last name that was written, use the
empty string. otherwise continue to [2]
2. If the name has been written earlier in the document, replace the
name with the number for the 'order of appearance' of the name. ie, the
first name to appear in the document, typically the element name of the
document element will get the number 0. (zero based indexing is assumed) 
3. If it is the first time the Name is used, write the whole name
unchanged and add it to the table of used names together with it's
number of appearance, and increase the appearance number counter.

This goes for both element and attribute names.


When parsing the compressed document the following is done:

Each time a valid Name is parsed (not starting with a digit) add it to
an array of names
Each time a number is parsed in the place of a name, look the name up in
the array using the number as the index

My experience is that the number of different Names in an xml document
is usually not that large and I think that the implementation should be
quite simple.

As an example the following example taken from the XML Schema Primer is
compressed:
	      <?xml version="1.0"?>
              <purchaseOrder orderDate="1999-10-20">
                  <shipTo country="US">
                      <name>Alice Smith</name>
                      <street>123 Maple Street</street>
                      <city>Mill Valley</city>
                      <state>CA</state>
                      <zip>90952</zip>
                  </shipTo>
                  <billTo country="US">
                      <name>Robert Smith</name>
                      <street>8 Oak Avenue</street>
                      <city>Old Town</city>
                      <state>PA</state>
                      <zip>95819</zip>
                  </billTo>
                  <comment>Hurry, my lawn is going wild!</comment>
                  <items>
                      <item partNum="872-AA">
                          <productName>Lawnmower</productName>
                          <quantity>1</quantity>
                          <USPrice>148.95</USPrice>
                          <comment>Confirm this is electric</comment>
                      </item>
                      <item partNum="926-AA">
                          <productName>Baby Monitor</productName>
                          <quantity>1</quantity>
                          <USPrice>39.98</USPrice>
                          <shipDate>1999-05-21</shipDate>
                      </item>
                  </items>
              </purchaseOrder>

The following names were found in order of appearance:
purchaseOrder 0
orderDate 1
shipTo 2
country 3
name 4
street 5
city 6
state 7
zip 8
billTo 9
comment 10
items 11
item 12
partNum 13
productName 14
quantity 15
USPrice 16
shipDate 17

The compressed document:
<?xml version="1.0"?>
<?staxx?>
<purchaseOrder orderDate="1999-10-20">
	<shipTo country="US">
		<name>Alice Smith</>
		<street>123 Maple Street</>
		<city>Mill Valley</>
		<state>CA</>
		<zip>90952</>
	</>
	<billTo 3="US">
		<4>Robert Smith</>
		<5>8 Oak Avenue</>
		<6>Old Town</>
		<7>PA</>
		<8>95819</>
	</>
	<comment>Hurry, my lawn is going wild!</>
        <items>
                      <item partNum="872-AA">
                          	<productName>Lawnmower</>
                          	<quantity>1</>
                          	<USPrice>148.95</>
                          	<10>Confirm this is electric</>
                      </>
                      <12 13="926-AA">
                          	<14>Baby Monitor</>
                          	<15>1</>
                          	<16>39.98</>
                          	<shipDate>1999-05-21</>
                      </>
        </>
</>

By compression we're saving the following text from being written:
name, street, city, state, zip, shipTo, country, name, name, street,
street, city, city, state, state, zip, zip, billTo, comment,
productName, quantity, USPrice, comment, comment, item, partNum,
productName, productName, quantity, quantity, USPrice, USPrice,
shipDate, item, items, purchaseOrder [229 characters]

18 characters for the numbers were added, saving a total of 211
characters out of originally 722. 
the processing instruction <?staxx?> was also added, 9 chars

The transfer was reduced with about 30% in this case.

Compared to the original compression scheme this should compress a bit
better, especially for longer documents where there are more than a few
occurences of the same Name.
It does however require a stack for handling the end tags, as well as
dynamic lookup tables during writing and parsing.  
I don't know if it is still SGML compliant, but I guess that there are
enough SGML expertise on xml-dev to help with that issue.

Please comment on anything I might have missed (It seemed a bit too easy
to actually work)


 
 
Rick Jelliffe wrote:
> 


> If anyone is interested in taking this further, I think it would be good.
> And it is probably the kind of small infrastructure upgrades that could be
> fun and doable for open-source and collaborative development.
> 
> Cheers
> Rick Jelliffe

-- 
Fredrik Lindgren
Goyada AB