[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] MarkMail: now archiving xml-dev
- From: Jason Hunter <jhunter@acm.org>
- To: "Edward C. Zimmermann" <edz@bsn.com>
- Date: Wed, 28 Nov 2007 11:35:07 -0800
Edward C. Zimmermann wrote:
> Quoting Jason Hunter <jhunter@acm.org>:
>
>> Edward C. Zimmermann wrote:
>>> Quoting Elliotte Rusty Harold <elharo@metalab.unc.edu>:
>>>
>>>> Jason Hunter wrote:
>>>>
>>>>> What if they start consuming
>>>>> disk or thrashing the disk IO? When you query against hundreds of gigs
>>>>> of content, you don't have to be malicious to mess things up.
>>> Its not 100s of GB. Mailing lists are not that large.
>> Apache's messages in raw mbox format weigh in just shy of 60 Gigs.
>
> If you say so--- although I'm really quite amused that the there could
> be 60 GB of text in their lists..
If you divide 60 Gigs by 4,000,000 emails that's 15k per email. That's
bigger than I would have guessed an average email to be, but you have to
take into account the full headers and the influence of the (relatively
few) binary attachments.
>> Converting mbox emails to enriched XML involves an expansion.
>
> When I index mail I don't bother.
Well, we probably have different goals and infrastructure technologies.
I want to have access to the hierarchical internal structure of each
email body, and to help me accomplish that I have a tool that thinks in
XML so it's a natural representation.
Of course with MarkLogic you don't store XML files on disk, any more
than Oracle stores CSV files on disk. XML is just the representation
data model.
> Why parse and tag mail to then parse it as
> XML when one can parse it directly (which makes also a lot of sense given the
> observation that mail contains overlapping context structures such as lines
> and sentences) into the "internal" structures that one is using anyway
> (especially given that one wants to see the mail as given, noting the
> use of physical position to convey meaning as-if ee.cummings)?
If you only fetched mail by id, then I could parse it on the fly for
rendering. But if I'm to use the structure in the query, it needs to
exist in the database in its enriched format.
>> So, in fact, it's 100+ Gigs of XML content.
>
> Do you index it in one big lump or is it segmented?
It operates in many ways like a database. Every new email that arrives
is incorporated into the index immediately. The index model is are able
to do that while also keeping performance up, using an index merging model.
-jh-
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]