[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Abbreviated Tag Names

From: Charles Reitzel <creitzel@mediaone.net>
To: xml-dev@lists.xml.org
Date: Tue, 23 Jan 2001 23:23:55 -0500 (EST)
Synopsis: 1) tag compression - there's usually a better way
          2) tag "mapping" is useful generally

I have a perl application that searches a text file of tags and values.  It
is real performance sensitive, so no XML.  Perl regexes work well on some
perhaps pathalogical queries that were killing Sybase.  Anyway, due to data
growth we tune this thing on a regular basis.  I found that tags were about
25% of the file, delimiters (tabs and newlines) were about 5%.  XML would
cost about twice as much, on the same data.

All previous points about I/O and text conversion *not* being the
bottlenecks are true.  In this case, the bottleneck was Perl regex processing.

Anyway, I was considering tag compression (fixed length tag names - two
digit base 36 number w/ no delimiter) which would be a) much smaller and b)
easier for perl regex to chew on.  Wary of getting too weird, however, I
opted against this approach and put the whole thing into dbm files w/ some
text indexing techniques and we picked up an order of magnitude speed
improvement (often better).  Even so, careful attention to I/O coding proved
pivotal.

The moral: use the right algorithm.  

All that said, we regularly map our meta-data to alternate "tag" names
(occasionally in XML) and have devised standard techniques for doing so, not
altogther different than what Simon has suggested.  I.e. a lightweight,
locally defined interface for mapping tags and values coming in or going out.

>At 04:38 PM 1/21/01 -0800, Don Park wrote:
>So, I have been thinking about abbreviated tag names 
>and wanted your thoughts on the subject.  There are many
>aspects to this issue:
>
>1) should schemas be expanded or an alternate version 
>   be used?

Varies.  Along w/ different names, sometimes you have format and actual
processing that must take place to complete the conversion.  In which case,
use an alternate schema.

>2) should a new namespace be defined or old namespace 
>   be reused?

The NS goes w/ the schema 99/100 of the time.

>3) what role does RDDL play?

Could be useful.  Since the application "owns" it, you can embed data into
the arcrole attribute to identify the target format and the necessary
transformation inputs, etc. These will be app-specific and could include
things like references to XSLT style sheets to transform instance documents
of a given root element type to a desired schema. You might have multiple
root element types and target XSchema schemas and, thus, need to distinguish
by arcrole.  But, if it isn't actually a URL, be sure to use the "urn:" URI
scheme.  The data embedding is OK for an in-house type thing.  Probably not
a good thing for any standard tool.  I.e. the RDDL standard is used to
implement an in-house tool.

>4) should there be a dynamic abbreviation mechanism? 
>   [no, imho]

When considering tag compression, I thought this one over several times.  I
actually ended up "auto or not at all".  Conceivably, both the schema and
the instance documents could be mapped upon transmission (in or out) using a
common dictionary.  Thus, there is no maintenance.  Otherwise, entropy will
kick in.  Put the other way, the maintenance cost is enough to avoid manual
short tags!

>5) how should abbreviated version of existing standards 
>   be created?

If the standard defines a compressed format, then no problem (entropy be
danged).  Otherwise, IMO, the compressed messages are for private use.  

E.g. doesn't XML/EDI define short and long tags for message segments?

>6) should there be standard rules for abbreviating 
>   tag names?

Stax looks pretty slick.  It is loss-less and avoids the need for a lookup
dictionary for tag numbers or such like.

For mapping tags generally, see #1.  Often a simple lookup table will do it.
When it won't, you'll need to write some code.  E.g. date format conversion.

my $0.02 worth,
Charles Reitzel

On Mon, 22 Jan 2001, Simon St.Laurent wrote:
>
>RDDL's gotten me thinking again about dictionary resources, >and I'm doing
a presentation on transformations next week.
>
>It seems like there are a substantial number of cases where >1-1
equivalence actually happens in the world - bbreviation 
>and translation being the two largest.  I'm pondering 
>(haven't yet built) a thesaurus processor, which
>lets you feed in a set of rules and specify which set 
>applies, and then run it over documents.
>
>It does less than XSLT and carries less freight than XML 
>Schema equivalence classes, which seems like a good thing
>to me.  I suspect it won't be that hard to implement as a
>SAX filter, XSLT transform, or DOM processor, though
>I'm still getting started.
>
>Dictionary files add more weight, of course, but there >might be ways to
get around that for a lot of projects.
>
>I wasn't planning on mentioning it until I had something 
>to show, but since
>you mentioned....
>
Prev by Date: TREX to OASIS?
Next by Date: Re: Mapping a UML model to a DTD or Schema (fwd)
Previous by thread: RE: Abbreviated Tag Names
Next by thread: RDDL spec 20010122
Index(es):
- Date
- Thread