xml-dev - Re: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of t

Re: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of t

[ Lists Home | Date Index | Thread Index ]

To: "Hunsberger Peter" <Peter.Hunsberger@stjude.org>
Subject: Re: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of the devil" ?)
From: "Chiusano Joseph" <chiusano_joseph@bah.com>
Date: Mon, 04 Aug 2003 14:43:51 -0400
Cc: xml-dev@lists.xml.org
Organization: Booz Allen Hamilton
References: <1E0CC447E59C974CA5C7160D2A2854EC0983B3@SJMEMXMB04.stjude.sjcrh.local>

<Quote>
Essentially we have a large project that reuses (and reuses) various XML
fragments from many different sources in many different combinations
many times (controlled by some small number of parameters).  Think of
this as a cache of parsed XML that can subsequently be consumed by XSLT
transforms.  On occasion some portions of this cache will be invalidated
and replaced.
</Quote>

This sounds very much like a combination of several emerging standards,
the first being UN/CEFACT Core Components [1]. If you're interested in
learning more, please also reference a recent presentation I gave to a
federal working group on "Core Components and ebXML Registry" [2], which
also discusses the incorporation of Core Components into the ebXML
Registry architecture (a process which I am heading up - see slides
43/44).

The second standard is the OASIS Content Assembly Mechanism (CAM)
specification [3], described on slide 41 of the presentation.

Kind Regards,
Joe Chiusano
Booz | Allen | Hamilton

[1] http://xml.coverpages.org/CCTS-V1pt85-20020930.pdf
[2] http://xml.gov/presentations/bah/ebXMLcore.ppt
[3] http://www.oasis-open.org/committees/cam


"Hunsberger, Peter" wrote:
> 
> On Thursday, July 31, 2003 8:58 PM Tyler Close <tyler@waterken.com>
> wrote:
> 
> >
> > On Thursday 31 July 2003 19:36, Mike Champion wrote:
> > > On Thu, 31 Jul 2003 17:46:32 -0400, Tyler Close
> > <tyler@waterken.com>
> > > wrote:
> > > > For an example of a binary format that supports efficient string
> > > > interning, without a penalty to generality, see:
> > > >
> > > > http://www.waterken.com/dev/Doc/code/
> > >
> > > Very interesting point/idea.
> >
> > Thanks.
> >
> > > AFAIK much of the overhead of XML text
> > > parsing that the binary infoset advocates complain about is in the
> > > Unicode encoding/decoding and raw string processing (e.g,
> > looking at
> > > every character to see where an element ends rather than having a
> > > stored length).
> >
> > The Waterken(TM) Doc code format uses a chunked
> > representation for encoding a string.  This provides the
> > speed benefits of a length prefix without creating an
> > unlimited buffering requirement.
> >
> > >  Likewise, a number of alternative infoset serializations use the
> > > "stream of SAX events" metaphor, that sounds a bit like what that
> > > document describes.
> >
> > Same basic idea.
> >
> > > But  that doesn't sound like "string interning" to me (and
> > "interning"
> > > is not mentioned in that document).
> >
> > Notice that all the meta data (ie: the string identifiers)
> > are stored in a set of string registers. Subsequent uses of a
> > string specify the index of the string to use. This results
> > in each string identifier being instantiated just once. The
> > singleton instance is the interned instance.
> >
> > > I thought "interning" was more of a
> > > technique for keeping compiled code small by referencing redundant
> > > strings via their hash values.
> >
> > It's more to do with fast lookup than memory savings. The
> > hash only gets computed once and equality checks are just
> > pointer comparisons. Same thinking is at work in the Doc code format.
> 
> <snip/>
> 
> One of my vague long term projects is to look at ways of building and
> utilizing a sort of PSVI database.  (Binary XML that never leaves the
> building...)  Essentially we have a large project that reuses (and
> reuses) various XML fragments from many different sources in many
> different combinations many times (controlled by some small number of
> parameters).  Think of this as a cache of parsed XML that can
> subsequently be consumed by XSLT transforms.  On occasion some portions
> of this cache will be invalidated and replaced.
> 
> So the question becomes; do you think any of this work could form a
> basis for such a database?  Would it be efficient to parse XML to this
> format, then feed (multiple chained) XSLT transforms from this format?
> 
> I'd spend some time examining the code, but we're in the middle of a
> release and more than swamped at the moment... (For the Cocoon-dev
> lurkers on this list, yes, this is related to the discussion on long
> term caching models.)
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>

begin:vcard 
n:Chiusano;Joseph
tel;work:(703) 902-6923
x-mozilla-html:FALSE
url:www.bah.com
org:Booz | Allen | Hamilton;IT Digital Strategies Team
adr:;;8283 Greensboro Drive;McLean;VA;22012;
version:2.1
email;internet:chiusano_joseph@bah.com
title:Senior Consultant
fn:Joseph M. Chiusano
end:vcard

References:
- RE: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of the devil" ?)
  - From: "Hunsberger, Peter" <Peter.Hunsberger@stjude.org>

Prev by Date: Re: [xml-dev] Binary XML == "spawn of the devil" ?
Next by Date: RE: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of the devil" ?)
Previous by thread: RE: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of the devil" ?)
Next by thread: Re: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of the devil" ?)
Index(es):
- Date
- Thread