[
Lists Home |
Date Index |
Thread Index
]
- From: Clark Evans <clark.evans@manhattanproject.com>
- To: xml-dev@ic.ac.uk
- Date: Wed, 10 Feb 1999 00:28:57 +0000
Abstract:
By fixing the content of an XML file, a
position based index mechanism can be added
to XML files, allowing fractal parsing.
Introduction:
In a thin-client/server environment, especially those
implemented in an interpreted language, like Java,
is important to minimise client-side processing by
doing server-side pre-processing.
For example, suppose that an on-line shopping web
site has a thin-client ordering java applet. It could
quickly download, and start accepting customer
information, and other input. Simoutanenously,
it could be downloading a 250K+ file(s) containing
the package and product list, authorized shipping
agents, tax calculation tables, etc. Advanced
versions of the applet would "cashe" a copy of the
catalog locally, and only download deltas.
Several pre-processing items could occur, the most
obvious being a translation of the normalized schema:
PRODUCT_CATEGORY (CATEGORY_ID, CATEGORY_NAME)
BUNDLE_OF_PRODUCTS (BUNDLE_ID, BUNDLE_NAME, BUNDLE_PRICE)
VENDOR (VENDOR_ID, VENDOR_NAME)
BUNDLE-PRODUCT (BUNDLE_ID,PRODUCT_ID)
PRODUCT (PRODUCT_ID, PRODUCT_NAME,
CATEGORY_ID, INVIDUAL_SALE_FLAG,
PRICE_IF_SOLD_INDIVIDUALLY )
PRODUCT-VENDOR (PRODUCT_ID,VENDOR_ID)
BUNDLE-VENDOR (BUNDLE_ID,VENDOR_ID)
into a hierarchical drill-down that better meets
the particular needs of the order-entry client:
<catalog>
<product-category>
<product-bundle>
<product>
<vendor>
<individual-product>
<vendor>
In this example, several joins are interwoven into a
a single hierarchical "snapshot" to support the
the drill down requirements in the order-entry client.
Notice, that product-bundles, products, and vendors
*will* be duplicated with this scheme, this de-normalization
is exactly what is required since it makes the processing
on the client simpler. Here XML complements the
relational database by providing a de-normalized
stream of data instead of a normalized repository.
For another example, suppose a roaming-sales person
receives an update every morning in his e-mail with
new products, discontinued products, changes in pricing,
packaging, etc. Then, during the day, the sales peson
goes "door-to-door" selling the products and taking orders.
The orders are collected on his/her hard drive untill
the evening, when they are uploaded to the server for
approval.
I see XML as a great move forward in a standard transport
layer for this form of communication. Each order could
be a simple e-mail message, leveraging existing POP3/SMTP
standards. The messages would be queued during the day,
and send after the sales person is connected to the
network. In a similar way, the updates to the product
could be sent as via e-mail (xml-mail anyone?) as well.
THUS, we have moved the join from the client to the
server, but now, we have *increased* the parsing
requirements of the client... also, with a _large_
catelog file (3+MB?), it is unreasonable to think
that a collection of objects in memory would
be the result of the parsing.
THEREFORE, some form of storage/retrieval is necessary
on the client. This can be in a local database,
but that just increases the footprint and processing.
Instead of making a client-side database, and
re-normalizing the information, I suggest that
indexing the XML file may be a better alternative.
A way to do this, is to "fix" the XML file's binary
representaion, and build a physical index detailing
the "exact" location of an element within the file.
Requirement for such an index:
a) It should be embeddable inside XML, and should follow
XML if possible (perhaps it is a notation?)
b) It should allow indexing on arbitrary element attributes.
c) It should be created so that a change in one part of the
file has minimal impact on the rest of the XML file. Thus,
although a change to a child may require a re-adjustment
of information about it's parent, it shouldn't require
re-adjustment of information about each sibling.
d) It should take advantage of the "hierarchy" built
into the XML file, since the thin-client usage will
directly correspond to the "hierachy"
e) It should support typed entities and attributes
"Archetecutres", so that different attribute names
of sub-types can be indexed together.
f) Indexing an element based upon it's child elements
may not be required. If an index like this is needed,
perhaps a re-write adds an attribute with the
computed value and then this is indexed instead.
g) Working with linking is purely optional, and may
not be important to support. <opinion> If you are
using linking with transaction-oriented documents,
you should be using a relational database instead.
I see XML as bringing back the Hierarchical database
to *complement* relational technology, not to
*replace* it.</opinion>
================================================
What I propose is a "fractal" index inter-woven
into the XML data. First, here is the file to
be indexed:
<catalog date="03-FEB-1999" company="Acme Tools" >
<product-category name="Household" type="Domestic">
<individual-product name="Hammer" price="13.95"/>
<individual-product name="Driver, 1/4 inch" price="6.95"/>
<individual-product name="Driver, 1/8 inch" price="7.95"/>
<individual-product name="Allen-Wrench Set" price="11.55"/>
<product-bundle name="Household-Starter" price = "23.99" />
<bundled-product name="Hammer"/>
<bundled-product name="Driver, 1/4 inch"/>
<bundled-product name="Driver, 1/8 inch"/>
...
</product-bundle>
...
</product-category>
<product-category type="Commercial" name="Light-Industry" >
<individual-product name="Hammer" price="13.95"/>
<individual-product name="Versa Screw(tm)" price="66.95"/>
...
</product-category>
...
</catalog>
Here is the "indexed" example, I use line numbers for
the demonstration since it is easier to show in e-mail
form, however, I would see it being done by position instead.
I also use <!-- to comment stuff. -->
0001 <!-- other-information-before-the-catelog -->
...
0009 <catalog date="03-FEB-1999" company="Acme Tools" >
0010 <product-category name="Household" type="Domestic">
0011 <individual-product name="Hammer" price="13.95"/>
0012 <individual-product name="Driver, 1/4 inch" price="6.95"/>
0013 <individual-product name="Driver, 1/8 inch" price="7.95"/>
0014 <individual-product name="Allen-Wrench Set" price="1.55"/>
0015 <product-bundle name="Household-Starter" price = "23.99" />
0016 <bundled-product name="Hammer"/>
0017 <bundled-product name="Driver, 1/4 inch"/>
0018 <bundled-product name="Driver, 1/8 inch"/>
...
0033 </product-bundle>
...
0533 <index <!-- an index for "Household" category
-->
0534 name="Price" <!-- the listing is asending by price
-->
0535 index-start=525 <!-- (535-10), relative begining of index
-->
0536 delimiter="|" <!-- Hmm, possibly for readability
-->
0536 position-width=4 <!-- Length for each position, lpad="0"
-->
0537 length=100 <!-- Length of index
-->
0538 >
0539 <index-column name="name" width=30 align="left" rpad=" ">
0540 <index-element element="individual-product" attribute="price"
/>
0541 <index-element element="product-bundle" attribute="price" />
0542 </index-column>
0543 0004|Allen-Wrench Set | <!-- First item... -->
...
05?? 0005|Household-Starter |
...
05?? 0008|Allen-Wrench Set |
...
0632 </index>
0633 <index
0634 name="Price" <!-- the index is asending by price -->
0635 index-start=625 <!-- (635-10), relative begining of index -->
0636 delimiter="|"
0636 position-width=4
0637 length=100
0638 >
0639 <index-column name="price" width=5 align="right" lpad="0">
0640 <index-element element="individual-product" attribute="price"
/>
0641 <index-element element="product-bundle" attribute="price" />
0642 </index-column>
0643 0433|01.23 <!-- Cheapest item... -->
...
06?? 0002|06.95 <!-- Refers to line 10+2=12 -->
...
06?? 0005|23.99 <!-- Referrs to line 10+5=15 -->
...
0732 </index>
....
???? </product-category>
???? <product-category type="Commercial" name="Light-Industry" >
???? <individual-product name="Hammer" price="13.95"/>
???? <individual-product name="Versa Screw(tm)" price="66.95"/>
...
???? <index name="Price"
....
???? <index name=""
....
???? </product-category>
....
????
I'm getting tired here, but now, you would have
the next level of indexes, here indexing the
product categories, instead of the products
in the product category.
???? </catalog>
-------------------------------------------
Anyway, I wrote this up about 2 months ago
and have implemented something *very* similar
to it successfully in a client/server order
entry system about 6 years ago using flat files
and a proprietary file format. I did this
beacuse the client parsing times were getting
to be large and the frequent updates were
starting to cause problems. Anyway, the
speedup obtained by a multi-attribute index
was hudge and well worth the added complexity.
Since the index is fractal at the element
level, a change in the file means re-computing
the indexes for all parents transitively.
Re-computation of the sibling indexes is avoided!
As an added advantage, carrage returns
were used so that "upgrades" of the
files could be distribued merged with "diff".
Just thinking outloud here....
Clark Evans
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)
|