Re: [xml-dev] MarkMail: now archiving xml-dev

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "Edward C. Zimmermann" <edz@bsn.com>
To: Jason Hunter <jhunter@acm.org>
Date: Wed, 28 Nov 2007 22:12:48 +0100

Quoting Jason Hunter <jhunter@acm.org>:

> If you divide 60 Gigs by 4,000,000 emails that's 15k per email.  That's 
> bigger than I would have guessed an average email to be, but you have to 
> take into account the full headers and the influence of the (relatively 
> few) binary attachments.

Even with "full headers" I think 15k average message size (excluding
attachments) is suspect. A chunk of email headers could-- if one is bothering
to clean things up-- be excluded as about the path of email transmission
and not content. In a service its not really of interest to anyone how
the mail arrived and got bounced around in one's own network--- and often
we don't want to even publish such information.

> 
> >> Converting mbox emails to enriched XML involves an expansion.
> > 
> > When I index mail I don't bother.
> 
> Well, we probably have different goals and infrastructure technologies. 
>   I want to have access to the hierarchical internal structure of each 
> email body, and to help me accomplish that I have a tool that thinks in 
> XML so it's a natural representation.

Who says that one can't have access to the "hierarchical internal structure of
each email body"? When I parse emails I "identify" on-the-fly (and model
as internal structure) header meta-data (including parsing the special
types such as the dates, arrival times, content length, priority etc.) and
typical body structure: lines, sentences, paragraphs (and pages). Mailing
lists are a bit more hierarchical--- especially digest formats-- with
sub-messages, in turn with their own meta-data bits and lines, sentences,
paragraphs. Through the same process that one would auto-tag a mail folder
to create a glob of XML one could just as well go directly to the internal
data representation and save a parse-puke-parse (aside from the observation
that some of the structures in mail are overlapping and have other
characteristics that demand much "arm twisting" in XML). 
--- and since I have the structure I can, should I so desire, puke
on-the-fly XML (since we can select at search whatever unit of retrieval
we desire, not just as message, I think we have even a lot more to gain).
Mail is constantly in flux as new messages define new ....

> Of course with MarkLogic you don't store XML files on disk, any more 
> than Oracle stores CSV files on disk.  XML is just the representation 
> data model.

My philosophy is to try to tackle whatever representation model is thrown
at me. Mail is a model. This way I can throw XML, mail and all kinds of
other inputs into a big heap, search them (exploiting their structure),
retrieve bits (exploiting their structure for unit of retrieval) and, should
I desire, convert on the fly into other representations.. With a semantic
crosswalk one can do some really really wacky things :-)

> 
> > Why parse and tag mail to then parse it as
> > XML when one can parse it directly (which makes also a lot of sense given
> the
> > observation that mail contains overlapping context structures such as
> lines
> > and sentences) into the "internal" structures that one is using anyway
> > (especially given that one wants to see the mail as given, noting the
> > use of physical position to convey meaning as-if ee.cummings)? 
> 
> If you only fetched mail by id, then I could parse it on the fly for 
> rendering.  But if I'm to use the structure in the query, it needs to 
> exist in the database in its enriched format.

Absolutely not. I do have records (for example, an email message) but I'm not
bound by it as unit of retrieval. I can fetch mail by context. One might want
to fetch mail by id as its a legitimate activity but one might be interested
in a single message in a digest as the "relevant" bit of information to a
query.. or for that matter a relevant "bit" might be a whole thread of a locus
of messages. 

> 
> >> So, in fact, it's 100+ Gigs of XML content.
> > 
> > Do you index it in one big lump or is it segmented?
> 
> It operates in many ways like a database.  Every new email that arrives 
> is incorporated into the index immediately.  The index model is are able 
> to do that while also keeping performance up, using an index merging model.
> 

Sure. Mailing lists are easy since its add/merge (we throw things in a queue
to not start an index for each and every mail that arrives to keep system
impact to a minimum without loss of effective functionality). We're even
doing the same with RSS/Atom/CAPs feeds and to keep things synchronized its
a bit wackier since delete/add/merge and garbage collect. We're indexing about
600 active News feeds and many update or change their stories at very
frequent rates (we also keep track of these changes since they can be
interesting over time).


> -jh-



-- 
  E. Zimmermann, BSn/Munich R&D Unit
  Leopoldstrasse 53-55, D-80802 Munich,
  Federal Republic of Germany
  http://www.nonmonotonic.net

Follow-Ups:
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>

References:
- MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Elliotte Harold <elharo@metalab.unc.edu>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: "Edward C. Zimmermann" <edz@bsn.com>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: "Edward C. Zimmermann" <edz@bsn.com>
- Re: [xml-dev] MarkMail: now archiving xml-dev
  - From: Jason Hunter <jhunter@acm.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]