Blog's control panel: | Home | Tags | Index | Rss 2.0

Hammering on the ham, a new approach to reading (lots of) emails.

Sun, 07 Jan 2007 | Permalink | Tags: ,

Email these days is all about fighting spam, and for good reasons. But if you work in IT you are, or probably should be, subscribed to several mailing lists, news feeds and whatnot, because knowing what's going on isn't an option, it's an obligation. And once your spam has been compressed and turned into small bits, and flushed down /dev/null, your inbox still counts about 300 new messages a day. If that's the case, as soon as it gets busy at work and with life in general, keeping up becomes a problem, and that's what happened to me, again. After 3 weeks away I'm trying to catch up and it's a pain, too many emails, full-disclosure was literally flooded with Xss reports, finding some interesting vulnerability reports/discussions has been a mission. This got me thinking about something very obvious and very old: not all the ham is good. That's pretty much the idea behind scoring, so different emails can have different weights. But how do you score an email? For people that like me use procmail/maildrop that means a series of rules matching headers or text and manipulating a score header or doing something else. And that's what I've done recently to get rid of all those nasty Xss SAs from full-disclosure and bugtraq. But this method doesn't really work, like matching "Viagra" in email's subject isn't any longer a good way to catch Viagra spam. But couldn't you reverse the comparison then, and say that advanced checks like those in antispam applications are what's really needed to filter ham?
But what is good ham in the first place? I'd define it as interesting emails, and an interesting email is:

  1. 1. about an argument you care for
  2. 2. from a sender whose high quality as a poster is known to you
  3. 3. well formed and for example not containing swearwords or other silly writing styles.
  4. 4. not similar to another one in the same thread or very recent one
Classification is really important because it's the primary criteria we should sort out emails by. Let's just think of full-disclosure again: it's a mailing list about IT security and vulnerabilities disclosure, but within that realm there are dozen of sub-realms, ie, the windows and Linux ones, the web related one, the database one, and so on. If your job is Linux sysadmin for several database clusters guess what, windows and web related vulnerabilities aren't of much interest to you. So really, why do we even waste a second reading the subject before moving on? I agree, probably the majority of us deal with a very limited amount of ham, but for all the others, what are we waiting for? Ham filtering is a must have!
Classification brings in tags concept, so something supporting email tagging has to be demanded, and possibly such as the tagging is done automatically (bayesian techniques) with manual learning and tag refinement available. And while we're at it, let's face the uselessness of separated mailboxes for separated lists: as it should be obvious by now when reading an email content matters, wherever it comes from is unimportant. At this point someone would be tempted to mentioned gmail, and indeed, it's a very good attempt to do what I'm describing, but the tagging, called "labelling" in gmail, is manual and not very flexible.
Item 4. is also very important, and there's no real implementation available these days. The closest thing you can see around, and it's far from being common, is the following procmail recipe:
:0 Wh: msgid.lock
| formail -D 65536 .msgid.cache
:0 a 
dup/
That eliminates strictly duplicated emails, but com'on, how many times do all the emails in a thread say more or less the same thing? Those could be easily considered duplicates as well. Once again we could use some bayesian filter to compare variance of messages and decide if one is unique or not.
Looking for a solution and for info to implement my own, I concluded that SpamAssassin already had all, or almost all, that I needed: the words added to the X-Spam header could be my tags and I could easily get it to add other customer headers if needed. Plus I'd get training and whitelisting for free, and the infrastructure is already in place for antispam purposes.
Once you got your tags and score applied to email messages it becomes a matter of taste and imagination. One could think of a tags based email reader such as mailboxes are tags or combinations thereof. Personally, to simplify things and avoid patching MUA's, I'm thinking about keeping the same email directory structure I've got now (a dir per List-Id) and creating additional ones with the list of tags sorted by name appended , after throwing away strict matches on tags I don't care about.
That's it, hopefully I can get SA to do all of that and solve once for all my too-much-ham problem.




SpikeLab.org is a Filippo Spike Morelli copyright 2005-2008
This work is licensed under Creative Commons Att-SA License.