Re: [xml-dev] MarkMail: now archiving xml-dev

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Jason Hunter <jhunter@acm.org>
To: "Edward C. Zimmermann" <edz@bsn.com>
Date: Wed, 28 Nov 2007 11:35:07 -0800

Edward C. Zimmermann wrote:
> Quoting Jason Hunter <jhunter@acm.org>:
> 
>> Edward C. Zimmermann wrote:
>>> Quoting Elliotte Rusty Harold <elharo@metalab.unc.edu>:
>>>
>>>> Jason Hunter wrote:
>>>>
>>>>> What if they start consuming 
>>>>> disk or thrashing the disk IO?  When you query against hundreds of gigs 
>>>>> of content, you don't have to be malicious to mess things up.
>>> Its not 100s of GB. Mailing lists are not that large.
>> Apache's messages in raw mbox format weigh in just shy of 60 Gigs. 
> 
> If you say so--- although I'm really quite amused that the there could
> be 60 GB of text in their lists..

If you divide 60 Gigs by 4,000,000 emails that's 15k per email.  That's 
bigger than I would have guessed an average email to be, but you have to 
take into account the full headers and the influence of the (relatively 
few) binary attachments.

>> Converting mbox emails to enriched XML involves an expansion.
> 
> When I index mail I don't bother.

Well, we probably have different goals and infrastructure technologies. 
  I want to have access to the hierarchical internal structure of each 
email body, and to help me accomplish that I have a tool that thinks in 
XML so it's a natural representation.

Of course with MarkLogic you don't store XML files on disk, any more 
than Oracle stores CSV files on disk.  XML is just the representation 
data model.

> Why parse and tag mail to then parse it as
> XML when one can parse it directly (which makes also a lot of sense given the
> observation that mail contains overlapping context structures such as lines
> and sentences) into the "internal" structures that one is using anyway
> (especially given that one wants to see the mail as given, noting the
> use of physical position to convey meaning as-if ee.cummings)? 

If you only fetched mail by id, then I could parse it on the fly for 
rendering.  But if I'm to use the structure in the query, it needs to 
exist in the database in its enriched format.

>> So, in fact, it's 100+ Gigs of XML content.
> 
> Do you index it in one big lump or is it segmented?

It operates in many ways like a database.  Every new email that arrives 
is incorporated into the index immediately.  The index model is are able 
to do that while also keeping performance up, using an index merging model.

-jh-

Follow-Ups:
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: "Edward C. Zimmermann" <edz@bsn.com>

References:
- MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Elliotte Harold <elharo@metalab.unc.edu>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: "Edward C. Zimmermann" <edz@bsn.com>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: "Edward C. Zimmermann" <edz@bsn.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]