[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] MarkMail: now archiving xml-dev
- From: "Edward C. Zimmermann" <edz@bsn.com>
- To: Jason Hunter <jhunter@acm.org>
- Date: Wed, 28 Nov 2007 12:46:20 +0100
Quoting Jason Hunter <jhunter@acm.org>:
> Edward C. Zimmermann wrote:
> > Quoting Elliotte Rusty Harold <elharo@metalab.unc.edu>:
> >
> >> Jason Hunter wrote:
> >>
> >>> What if they start consuming
> >>> disk or thrashing the disk IO? When you query against hundreds of gigs
> >>> of content, you don't have to be malicious to mess things up.
> >
> > Its not 100s of GB. Mailing lists are not that large.
>
> Apache's messages in raw mbox format weigh in just shy of 60 Gigs.
If you say so--- although I'm really quite amused that the there could
be 60 GB of text in their lists..
> Converting mbox emails to enriched XML involves an expansion.
When I index mail I don't bother. Why parse and tag mail to then parse it as
XML when one can parse it directly (which makes also a lot of sense given the
observation that mail contains overlapping context structures such as lines
and sentences) into the "internal" structures that one is using anyway
(especially given that one wants to see the mail as given, noting the
use of physical position to convey meaning as-if ee.cummings)?
There is, of course, the context of one message within the larger context
but that too is a more complex. One thread may be a part of another thread
and bits split-off going partially to completely off-topic to being
again a part of a topic with some other grand siblings.. Part of IR should
distinguish between announced part of threads (declaring with MESSAGE-ID and
References or even subject content) and information threads. Even declared
threads overlap.
>
> So, in fact, it's 100+ Gigs of XML content.
Do you index it in one big lump or is it segmented?
>
> -jh-
>
--
E. Zimmermann, BSn/Munich R&D Unit
Leopoldstrasse 53-55, D-80802 Munich,
Federal Republic of Germany
http://www.nonmonotonic.net
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]