I've got a question that isn't so SuSE-ish, but of interest just the same. I'm starting to use the Courier-imap server rather than the uw-imap server and I have a question about batch processing of spam. Typically what I do is sort email into folders to allow for corrections to the bayesian filters in case then got any of it backwards. I then run a crontab job to make all the corrections and then delete the old emails. Under mbox format this was easy, I could just do something like: sa-learn --ham < mbox_file && > mbox_file Under cyrus I tried something similar where I would read each file in seperately and then delete it. That fried my imap server for a few days. I'm more than a little reluctant to try it like that again. So, I'm asking, how would you go about reading in lots of emails for processing (spamassassin, bogofilter, razor) and then deleting them? Originally I thought I could take all the mail and store it into an mbox format, which would be very effective for batch processing, but courier-imap won't have any of that.
On Mon, Feb 09, 2004 at 11:45:29PM -0500 or thereabouts, Tom Allison wrote:
I've got a question that isn't so SuSE-ish, but of interest just the same.
I'm starting to use the Courier-imap server rather than the uw-imap server and I have a question about batch processing of spam.
Typically what I do is sort email into folders to allow for corrections to the bayesian filters in case then got any of it backwards. I then run a crontab job to make all the corrections and then delete the old emails.
Under mbox format this was easy, I could just do something like: Under cyrus I tried something similar where I would read each file in seperately and then delete it. That fried my imap server for a few days.
So, I'm asking, how would you go about reading in lots of emails for processing (spamassassin, bogofilter, razor) and then deleting them?
Originally I thought I could take all the mail and store it into an mbox format, which would be very effective for batch processing, but courier-imap won't have any of that.
Courier only supports Maildir format, as you may have gathered... There is a way to do what you want by Spamassassin daemon (spamd), which is called from either maildrop or procmail using spamc in a recipe. This is done on the IMAP server side, and has the added benefit of sorting your email into its various folders. The spam can then be identified in the headers, and continued to be processed to your users, where they can filter it using their MUAs, or spam can be moved to its own folder for review, or just deleted.. -- Gary
Gary wrote:
So, I'm asking, how would you go about reading in lots of emails for processing (spamassassin, bogofilter, razor) and then deleting them?
Originally I thought I could take all the mail and store it into an mbox format, which would be very effective for batch processing, but courier-imap won't have any of that.
Courier only supports Maildir format, as you may have gathered...
Thanks for the response. I think I should have provided more details the first time around. What I had working under uw-imap and mbox files was a process where mail was tested under spamassassin and bogofilter and then filed accordingly. However, these test are not 100% and sometimes you get spam through the filters and ham in the spam folder. In order to teach the bayesian filters easily, for me, I simply moved the incorrect email into folders I designated for corrections: ham2spam and spam2ham. I would then run a cron job to essentially read in the mbox files into sa-learn and bogofilter to re-learn the bayesian statistics. At the end of this, I would delete the email ( '> mail_file' ) and everything was done. But I'm not sure how this could be done using the maildir format. for F in `ls INBOX'; do sa-learn --ham < $F && rm $F done might work, but when I tried this under Cyrus-imap I messed up the folders so bad they didn't really recover. I'm not sure what else might be tried.
Quoting Tom Allison
I've got a question that isn't so SuSE-ish, but of interest just the same.
I'm starting to use the Courier-imap server rather than the uw-imap server and I have a question about batch processing of spam.
Typically what I do is sort email into folders to allow for corrections to the bayesian filters in case then got any of it backwards. I then run a crontab job to make all the corrections and then delete the old emails.
Under mbox format this was easy, I could just do something like: sa-learn --ham < mbox_file && > mbox_file Under cyrus I tried something similar where I would read each file in seperately and then delete it. That fried my imap server for a few days.
I'm more than a little reluctant to try it like that again.
So, I'm asking, how would you go about reading in lots of emails for processing (spamassassin, bogofilter, razor) and then deleting them?
Originally I thought I could take all the mail and store it into an mbox format, which would be very effective for batch processing, but courier-imap won't have any of that.
I am doing just what you describe under Courier-IMAP, putting false positives in IMAP folders and having a cron job process and delete them. Cyrus, IIRC, uses an index for faster lookup. It may not like it when a file is still in the index and has been deleted. This is just a guess, I have never used Cyrus. I have used Courier-IMAP for years and it has no lasting problem with other programs deleting messages in the Maildir. Note: if I pull up the message list in SquirrelMail that goes thru the IMAP, delete one of the messages with Mutt that reads the message files directly, and then try to read that message in SquirrelMail, it will produce some kind of error message, but it recovers fine. HTH, Jeffrey
Quoting Tom Allison
: I've got a question that isn't so SuSE-ish, but of interest just the same.
I'm starting to use the Courier-imap server rather than the uw-imap server and I have a question about batch processing of spam.
I am doing just what you describe under Courier-IMAP, putting false positives in IMAP folders and having a cron job process and delete them. Cyrus, IIRC, uses an index for faster lookup. It may not like it when a file is still in the index and has been deleted. This is just a guess, I have never used Cyrus. I have used Courier-IMAP for years and it has no lasting problem with other programs deleting messages in the Maildir. Note: if I pull up the message list in SquirrelMail that goes thru the IMAP, delete one of the messages with Mutt that reads the message files directly, and then try to read that message in SquirrelMail, it will produce some kind of error message, but it recovers fine.
I would say your guesses here are very accurate. I'm using scripts now to to batch email processing for spam filtering and reporting. BTW, if you can afford it, I would encourage you to use the razor functions if possible to report spam back to the razor network. The more people that contribute to this, the more effective it becomes at blocking spam.
On Tuesday 17 February 2004 03:45, tallison@tacocat.net wrote:
BTW, if you can afford it, I would encourage you to use the razor functions if possible to report spam back to the razor network. The more people that contribute to this, the more effective it becomes at blocking spam.
But can you make a case that razor does anything of value? My observation is that razor misses way more spam than spamassassin (even spamassassin configured WITHOUT razor). So This has lead me to the conclusion that razor is something of a leache on the fine work of bays analysis and other spamassassin hits catches. Originally Vipul explicitly asked that no automated submissions of spam be fed to razor. He backed off of that for some reason, and now people are feeding razor with all the spam caught by bays and spamassassin. Yet in spite of this razor catches only about %15 of the spam. You are far better off training your bays filters than spending any effort training razor, because two years of training and its still far behind. -- _____________________________________ John Andersen
participants (5)
-
Gary
-
Jeffrey L. Taylor
-
John Andersen
-
tallison@tacocat.net
-
Tom Allison