Hammering on the ham, a new approach to reading (lots of) emails.
Email these days is all about fighting spam, and for good reasons. But
if you work in IT you are, or probably should be, subscribed to several mailing
lists, news feeds and whatnot, because knowing what's going on isn't an option,
it's an obligation. And once your spam has been compressed and turned into
small bits, and flushed down /dev/null, your inbox still counts about 300 new
messages a day. If that's the case, as soon as it gets busy at
work and with life in general, keeping up becomes a problem, and that's what
happened to me, again. After 3 weeks away I'm trying to catch up and it's a pain,
too many emails, full-disclosure was literally flooded with Xss reports,
finding some interesting vulnerability reports/discussions has been a mission.
This got me thinking about something very obvious and very old: not all the ham
is good. That's pretty much the idea behind scoring, so different emails can
have different weights. But how do you score an email? For people that like me
use procmail/maildrop that means a series of rules matching headers or text and
manipulating a score header or doing something else. And that's what I've done
recently to get rid of all those nasty Xss
SAs from full-disclosure and bugtraq. But this method doesn't really work,
like matching "Viagra" in email's subject isn't any longer a good way to catch
Viagra spam. But couldn't you reverse the comparison then, and say that
advanced checks like those in antispam applications are what's really needed to
filter ham?
But what is good ham in the first place? I'd define it as
interesting emails, and an interesting email is:
- 1. about an argument you care for
- 2. from a sender whose high quality as a poster is known to you
- 3. well formed and for example not containing swearwords or other silly writing styles.
- 4. not similar to another one in the same thread or very recent one
Classification brings in tags concept, so something supporting email tagging has to be demanded, and possibly such as the tagging is done automatically (bayesian techniques) with manual learning and tag refinement available. And while we're at it, let's face the uselessness of separated mailboxes for separated lists: as it should be obvious by now when reading an email content matters, wherever it comes from is unimportant. At this point someone would be tempted to mentioned gmail, and indeed, it's a very good attempt to do what I'm describing, but the tagging, called "labelling" in gmail, is manual and not very flexible.
Item 4. is also very important, and there's no real implementation available these days. The closest thing you can see around, and it's far from being common, is the following procmail recipe:
:0 Wh: msgid.lock | formail -D 65536 .msgid.cache :0 a dup/That eliminates strictly duplicated emails, but com'on, how many times do all the emails in a thread say more or less the same thing? Those could be easily considered duplicates as well. Once again we could use some bayesian filter to compare variance of messages and decide if one is unique or not.
Looking for a solution and for info to implement my own, I concluded that SpamAssassin already had all, or almost all, that I needed: the words added to the X-Spam header could be my tags and I could easily get it to add other customer headers if needed. Plus I'd get training and whitelisting for free, and the infrastructure is already in place for antispam purposes.
Once you got your tags and score applied to email messages it becomes a matter of taste and imagination. One could think of a tags based email reader such as mailboxes are tags or combinations thereof. Personally, to simplify things and avoid patching MUA's, I'm thinking about keeping the same email directory structure I've got now (a dir per List-Id) and creating additional ones with the list of tags sorted by name appended , after throwing away strict matches on tags I don't care about.
That's it, hopefully I can get SA to do all of that and solve once for all my too-much-ham problem.