[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Granularity
- From: Peter Flynn <peter@silmaril.ie>
- To: xml-dev@lists.xml.org
- Date: Wed, 11 Jan 2012 22:32:14 +0000
On 06/01/12 15:46, Cox, Bruce wrote:
> When developing our Reference Document Management Service, we asked the editors.
The problem we faced when developing the original search interface for
the CELT documents (celt.ucc.ie) was identifying whom to satisfy. Now
(20 years on), we know more about the user population and their
requirements, but it was assumed at the start that returning adequate
context would be essential. The original interface is no longer
available (I have it on a SunOS 4.1.3 system disk that won't boot, and
somewhere on a chain of 50 QIC tapes, so I will retrieve it one day :-)
The documents are transcriptions of early manuscripts, varying from
continuous narrative (very long "paragraphs") to annals where an entry
may be a TEI <p> element containing two words. Because we were using an
SGML search tool (PAT), and the documents are well-marked with numbering
systems and milestones, retrieving a fully-formed reference for each hit
was, if not trivial, at least straightforward, so we could peg each hit
as occurring in entry x at date y in para z and upwards through the
chain of folios, pages, sections, etc.
But that still left us with the problem of what context and how much
context to display. A large amount of the text was very heavily marked
with critical and analytical apparatus, with character data occurring
(in extreme cases) up to 11 levels deep -- more if it was in a document
embedded inside another, such as a letter quoted in its entirety. We
used a crude dividing line between mixed content and element content:
regardless of how deep the hit occurred, identify the closest ancestor
which occurred in element content; if there was at least one sibling of
the same type which contained character data (no matter how deep), then
go no further; otherwise take the parent and try again.
For display, the target content was stripped of markup and the first hit
within it measured for its distance in characters from the start of its
element-content ancestor container and the distance either side to the
nearest sentence boundary (if such a thing was discernible). Ellipses
were used to truncate fore and aft if necessary, so that no context more
than (I think) 50 words would appear -- but in measuring this, we *did*
trespass across parent boundaries when the hit was very close to the
start or end of its element-content ancestor container, because the
preceding or following element was regarded as important for the context.
Extra conditions were applied when the hit was in an embedded document
as mentioned above, so that it could be seen to be such; and for the
occasional very small single-paragraph document (usually manuscript
fragments).
This seemed to work, and allowed scholars (the primary audience) to find
the words they were looking for and easily discard those hits which
weren't relevant for their purpose. It was also pretty slow, being coded
in early CGI script form. PAT provided sub-second retrieval, but the
subsequent poking around really chewed up the time.
It failed miserably when it became clear that a large number of accesses
were coming from Irish Americans (and others) searching for their family
names, not realising that they would not occur in a recognisable form in
8th century Latin or Irish (and sometimes turning up words whose
spelling was liable to be misconstrued if taken out of context :-)
It was abandoned when we realised that the actual goal of the scholars
was to identify the documents they wanted, and then download them for
local use or just read them in their entirety in their browser. A lot of
users did like seeing exactly where a hit occurred: it gave them
confidence that the system was doing something meaningful and sensible;
but the net result was always finding the right documents and reading or
downloading them. We could have saved ourselves a lot of time by using
grep on the stripped text :-) but, hey, it was a learning curve.
Moral: identify the use cases first :-)
///Peter
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]