xml-dev - Re: [xml-dev] A standard approach to glueing together reusable XML frag

Re: [xml-dev] A standard approach to glueing together reusable XML frag

[ Lists Home | Date Index | Thread Index ]

To: Bruce.Cox@USPTO.GOV
Subject: Re: [xml-dev] A standard approach to glueing together reusable XML fragments in prose?
From: "Chiusano Joseph" <chiusano_joseph@bah.com>
Date: Fri, 22 Aug 2003 08:56:35 -0400
Cc: xml-dev@lists.xml.org
Organization: Booz Allen Hamilton
References: <86A9A7F2AFF74941A3885A174686E48D0D9158F1@uspto-is-104.uspto.gov>

<Quote>
For us, then, it is unlikely that there ever would be a practical
application for reusable content on anything other than a fairly small
scale.
</Quote>

Yes - and I'll bet there would be a high value in reusable metadata -
e.g. schemas - for patent specifications.

Kind Regards,
Joe Chiusano
Booz | Allen | Hamilton

Bruce.Cox@USPTO.GOV wrote:
> 
> As a rule, there is little or no reusable content in patent specifications.
> Not surprising, since they are *supposed* to be unique.  There is reusable
> content in many of the publications produced by the USPTO that explain how
> to file a patent, etc., but there are only a few dozen of such documents as
> opposed to about 6.5 million published patent grants.  (Only about half of
> those are available as text, starting in the 1970's, and only those
> published since 1999-04-13 are available as SGML/XML.  If we convert the
> backfile to our current XML DTD, we expect to need no more than a few
> variations of the DTD to accommodate differences in publishing practice over
> the period 1790 to the present.)
> 
> Next year, we will begin developing means to process patent applications and
> correspondence with applicants in XML.  The current application backlog is
> about 500,000, and with a minimum of say, four or five messages, the number
> of transactions is fairly large.  Here, there is reusable content (some few
> hundred "form paragraphs") that examiners pick from cascading menus,
> depending on the nature of the correspondence.  (This is not random letter
> writing, but highly ritualized gesture based on statute, rules, and past
> litigation.)  Once the correspondence is sent, however, it is static, and
> never changes.  The same is true for published grants and published
> applications, that is, they are static.
> 
> As for searching, we use OpenText's BRS Search (does not support XML at
> present).
> 
> For us, then, it is unlikely that there ever would be a practical
> application for reusable content on anything other than a fairly small
> scale.
> 
> Bruce B. Cox
> SA4XMLT
> USPTO/OCIO/AETS
> 703-306-2606
> 
> -----Original Message-----
> From: dbexcom [mailto:lbradshaw@dbex.com]
> Sent: Tuesday, August 19, 2003 11:47 AM
> To: mitch.amiano@softwareadjuvant.com; xml-dev@lists.xml.org
> Subject: Re: [xml-dev] A standard approach to glueing together reusable XML
> fragments in prose?
> 
> I am concerned to hear this approach, and others here, discussed, without
> comment as to scaling issues regarding very large datastores (in XML
> documents or in relational dbms) that might be ten to several hundred
> terabytes in size.
> 
> Specifically, in the following respects:
> 1- sheer size problems such as disk access time, out of memory conditions,
> and processor time to parse very large XML documents (say, 1,000 documents
> of 1 terabyte each) or a very large number of XML documents of smaller size
> (say, 5,000,000 5MB docs).
> 2- maintenance issues driven by the smallest of interface changes or
> presentation changes, that result in hundred of thousands if not millions of
> manual static schema modifications, rippling across either a very large
> number of smaller XML documents and their specific schemas or through as
> many as a thousand or so documents of 1 terabyte each in size. Even if such
> ripple effect maintenance can be automated, the processing time required to
> update, say,  5,000,000 XML doc files of 5MB each cannot be said to be real
> time, so perhaps weeks of processing time is required before the interface
> mods can be subject to just one full test.
> 3- consistency across versions, releases, XML standards and tool sets (MS,
> SQL Server, MySQL, Oracle, etc) considering that a very large scale project
> will take some time to mature (possibly years), and that a lack of backward
> compatibility could drive massive changes into the basic XML design
> structure and overall document architecture.
> 4- transmission time across interchanges - whether lan, web or intranet
> based, the time to transmit and parse result sets to XQuery are often very
> large, and for very large XML documents this processing time is unacceptably
> long. People want results in five to eleven seconds, not minutes, not hours.
> 
> I have specific experience in very large paper based, and relational
> database systems. From time to time, I see folks scale up systems that work
> fine, up to a point, past which they are forced to redesign from scratch.
> 
> While I agree that broadly generalized discussions are the most common form
> of technical exchange of information, having seen several of these pilot
> efforts crash and burn, I feel a moral obligation to suggest that some
> comment be made as to scaling issues, known propagation or ripple effects,
> and sheer size problems that come into play when viable "average"
> architectures are scaled beyond their design parameters.
> 
> In reference to this specific method, I submit that when dealing with a very
> large repository of prose, that a very large number of "profile documents"
> is possible, and that the number of possible "profile documents"
> correlates to some index of the context and the subject matter and the usage
> purposes (inquiry / result pairs), a result that to my mind increases or
> scales up as the number of prose entities scales up. I will go further and
> say that, for instance, for all articles ever published in the scientific
> journal "Nature", or perhaps all items in the U.S. Library of Congress or
> all pending applications and issued patent files in the U.S.
> Patent Office, this number of possible "profile documents" becomes very
> large indeed. Though it may be possible to satisfy as much as a majority if
> inquiries with a small number of such structures, the rest of the inquiries,
> it seems to me, will require an ever increasing number of "profile
> documents" to satisfy so that satisfying the last 1 percent of such
> inquiries might require several thousands of such "profile documents", if
> not tens of thousands or hundreds of thousands.
> 
> So, I am interested to hear about practical applications using XML only
> implementation (XQuery, XML, XSLT, XPath, etc) that deal with wide ranging
> subject matter, such as is found  in the scientific journal "Nature", or
> perhaps all items in the U.S. Library of Congress or all pending
> applications and issued patent files in the U.S. Patent Office, to a very
> broad audience, across scientific disciplines and cultures (and possibly
> languages), for a very large data repository of mixed content (prose,
> graphics, slides, photos, video, sound, other streaming data sources or
> media) measured in tens or hundreds of terabytes.
> 
> While XML is superb at document mark up, in my experience almost as good as
> TeX, it does not strike me as the best tool for the job when dealing with
> very large scale data repositories. Still, I have an open mind and perhaps
> someone here can enlighten me.
> 
> Thank you.
> 
> At 10:28 PM 8/18/2003 -0400, you wrote:
> >One of the difficulties in considering factoring out functionally
> >dependent entities from prose, is that the block of prose may itself
> >not be worth reusing. That is, the prose may be a one-shot document
> >whose original intent is simply to present information, not to act as a
> >reliable container for access by clients with a variety of intents.
> >One thing I've done is to try to identify those concepts which are best
> >understood, are most firmly established, and which serve as the focus
> >of the stakeholders' activities and communications.  Then design a
> >profile document for each of these high-level concepts, which provide
> >context for making pointers and for generating identifiers. The
> >profiles are designed to provide some elements which are rigidly
> >structured, and other elements which are prose with mixed content. In
> >one case at least, this allowed me (with a stylesheet) to resolve most
> >cross references internal to the document itself, minimizing calls to
> >scan external documents. Also, depending upon the nature of your data
> >and your validation techniques, you may be able to use the mixed
> >content prose as the source of the definitive information, rather than just
> as glue.
> >It is certainly something a good CMS can help with, but I've also used
> >DSSSL and XSLT/XPath for doing just this sort of thing with reasonable
> >results. You might also want to check out DITA by Michael Priestley et al.
> >of IBM, which I think intends to facilitate topical reuse.
> >
> >Roger L. Costello wrote:
> >
> >>Hi Folks,
> >>I am working with some people who wish to migrate from an all-prose
> >>format to a prose-plus-reusable-XML-fragments format.
> >>They have some data in prose that is useable in many contexts.  They
> >>want to break out that reusable data  into XML fragments.  However,
> >>they want to continue to provide the prose style.
> >>For example, consider this prose data:
> >><para>The city of Miami, Florida (pop. 1, 234,000) is a sprawling city
> >>with many attractions.  Miami Beach is a popular attraction.  The
> >>spring tide is ... The neap tide is ... </para> Examining this prose
> >>we can extract reusable info about the city of
> >>Miami:
> >><City id="Miami">
> >>     <state>Florida</state>
> >>     <population>1,234,000</population>
> >></City>
> >>We can also extract reusable info about tide data on Miami Beach:
> >><TideData id="MiamiBeachTides">
> >>     <springTide>...</springTide>
> >>     <neapTide>...</neapTide>
> >></TideData>
> >>The problem now is to create a framework which allows the prose to
> >>bring-together the independent, reusable XML components.
> >>Conceptually, what is desired is a "glue framework" like this:
> >><para>The <ref href="Miami.xml"> is a sprawling city with many
> >>attractions.  Miami Beach is a popular attraction.  The tides are <ref
> >>href="MiamiBeachTides.xml"><para>
> >>Thus, the prose is "glueing" together the XML fragments.
> >>Is this a problem that you have experience with?  What  "glue
> >>framework" have you used?  What strategy did you use to merge the XML
> >>fragments with the prose?  Is there is a standard way of combining
> >>semi-structured data with structured data?
> >>/Roger
> >>
> >>-----------------------------------------------------------------
> >>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> >>initiative of OASIS <http://www.oasis-open.org> The list archives are
> >>at http://lists.xml.org/archives/xml-dev/
> >>To subscribe or unsubscribe from this list use the subscription
> >>manager: <http://lists.xml.org/ob/adm.pl>
> >
> >
> >
> >-----------------------------------------------------------------
> >The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> >initiative of OASIS <http://www.oasis-open.org>
> >
> >The list archives are at http://lists.xml.org/archives/xml-dev/
> >
> >To subscribe or unsubscribe from this list use the subscription
> >manager: <http://lists.xml.org/ob/adm.pl>
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>

begin:vcard 
n:Chiusano;Joseph
tel;work:(703) 902-6923
x-mozilla-html:FALSE
url:www.bah.com
org:Booz | Allen | Hamilton;IT Digital Strategies Team
adr:;;8283 Greensboro Drive;McLean;VA;22012;
version:2.1
email;internet:chiusano_joseph@bah.com
title:Senior Consultant
fn:Joseph M. Chiusano
end:vcard

References:
- RE: [xml-dev] A standard approach to glueing together reusable XML fragments in prose?
  - From: Bruce.Cox@USPTO.GOV

Prev by Date: Re: [xml-dev] The Granularity of Markup (Re: [xml-dev] InkML)
Next by Date: Re: InkML
Previous by thread: RE: [xml-dev] A standard approach to glueing together reusable XML fragments in prose?
Next by thread: CFP: XML Database Symposium (XSym03) @ VLDB 2003
Index(es):
- Date
- Thread