Re: [xml-dev] Granularity

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Peter Flynn <peter@silmaril.ie>
To: xml-dev@lists.xml.org
Date: Wed, 11 Jan 2012 22:32:14 +0000

On 06/01/12 15:46, Cox, Bruce wrote:
> When developing our Reference Document Management Service, we asked the editors.

The problem we faced when developing the original search interface for 
the CELT documents (celt.ucc.ie) was identifying whom to satisfy. Now 
(20 years on), we know more about the user population and their 
requirements, but it was assumed at the start that returning adequate 
context would be essential. The original interface is no longer 
available (I have it on a SunOS 4.1.3 system disk that won't boot, and 
somewhere on a chain of 50 QIC tapes, so I will retrieve it one day :-)

The documents are transcriptions of early manuscripts, varying from 
continuous narrative (very long "paragraphs") to annals where an entry 
may be a TEI <p> element containing two words. Because we were using an 
SGML search tool (PAT), and the documents are well-marked with numbering 
systems and milestones, retrieving a fully-formed reference for each hit 
was, if not trivial, at least straightforward, so we could peg each hit 
as occurring in entry x at date y in para z and upwards through the 
chain of folios, pages, sections, etc.

But that still left us with the problem of what context and how much 
context to display. A large amount of the text was very heavily marked 
with critical and analytical apparatus, with character data occurring 
(in extreme cases) up to 11 levels deep -- more if it was in a document 
embedded inside another, such as a letter quoted in its entirety. We 
used a crude dividing line between mixed content and element content: 
regardless of how deep the hit occurred, identify the closest ancestor 
which occurred in element content; if there was at least one sibling of 
the same type which contained character data (no matter how deep), then 
go no further; otherwise take the parent and try again.

For display, the target content was stripped of markup and the first hit 
within it measured for its distance in characters from the start of its 
element-content ancestor container and the distance either side to the 
nearest sentence boundary (if such a thing was discernible). Ellipses 
were used to truncate fore and aft if necessary, so that no context more 
than (I think) 50 words would appear -- but in measuring this, we *did* 
trespass across parent boundaries when the hit was very close to the 
start or end of its element-content ancestor container, because the 
preceding or following element was regarded as important for the context.

Extra conditions were applied when the hit was in an embedded document 
as mentioned above, so that it could be seen to be such; and for the 
occasional very small single-paragraph document (usually manuscript 
fragments).

This seemed to work, and allowed scholars (the primary audience) to find 
the words they were looking for and easily discard those hits which 
weren't relevant for their purpose. It was also pretty slow, being coded 
in early CGI script form. PAT provided sub-second retrieval, but the 
subsequent poking around really chewed up the time.

It failed miserably when it became clear that a large number of accesses 
were coming from Irish Americans (and others) searching for their family 
names, not realising that they would not occur in a recognisable form in 
8th century Latin or Irish (and sometimes turning up words whose 
spelling was liable to be misconstrued if taken out of context :-)

It was abandoned when we realised that the actual goal of the scholars 
was to identify the documents they wanted, and then download them for 
local use or just read them in their entirety in their browser. A lot of 
users did like seeing exactly where a hit occurred: it gave them 
confidence that the system was doing something meaningful and sensible; 
but the net result was always finding the right documents and reading or 
downloading them. We could have saved ourselves a lot of time by using 
grep on the stripped text :-) but, hey, it was a learning curve.

Moral: identify the use cases first :-)

///Peter

References:
- Re: [xml-dev] Granularity
  - From: Dan Vint <dvint@dvint.com>
- RE: [xml-dev] Granularity
  - From: "Cox, Bruce" <Bruce.Cox@USPTO.GOV>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]