Re: [SLE] [9.0] How can i tell if Spamassassin is learning?

20 Aug 2004

      Rikard wrote regarding 'Re: [SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 11:12:
...
On Friday 20 August 2004 17.23, Danny Sauer wrote:
...
Rikard wrote regarding '[SLE] [9.0] How can i tell if Spamassassin is 
learning?' on Fri, Aug 20 at 07:03:
...
Hi all!
How can i determine if SA actually is learning via sa-learn?
I get a message that it processed xx files but it keeps missing out on
the same types of mails i have fed it some 10 times...
It only catches approx 10-20% of the spam i am receiving.
I have a bayes database and the contents in it changes after a sa-learn,
but it still fails to recognize spam.
The bayesian filter in only part of the weighted score a spam sees.  Do
you have long reports enabled?  If not, turn those on and see if the
probability the a message is spam according to the bayes DB goes up.  You
may also look at the spam score in the headers.  If you're getting a lot
of spam that's scored 4.9, you might move your threshold down to 4 instead
of leaving it at 5...
Note that the Bayes DB needs to learn from spam *and* ham to work well.
If you haven't trained it with roughly equal amounts of ham and spam,
it's not going to work well.  Also, if it hasn't seen on the order of a
few thousand of each message, it's not going to be working to its full
potential.  It takes time and lots of experience for it to learn, much
like most things. :)
I know that doesn't directly answer your question, but maybe it helps
none the less.  If sa-learn says it processed all of those messages and
doesn't throw an error, then it worked.  It will alert you if it doesn't
work.
--Danny
How do i enable "long reports", And where can i read those reports?
In /etc/mail/spamassassin/local.cf, set report_safe to 1 or 2 (1 is more
reasonable, probably), and add a line like
add_header all Report _REPORT_
to that file.  Then, you'll get an additional header in all of your spams
and non spams reporting on all of the tests applied to the message whether
it's marked as spam or not.
...
The missed spams vary between 1.5 to almost 5 (my threshold is set to 5)
I keep teaching SA about once a week.
perldoc Mail::SpamAssassin::Conf is good reading.  You have to have at
least 200 ham and 200 spam messages in bayes before it's even used.  You
may make sure you're at the point... :)
...
I move all missed spam manually to a specific mailfolder and run sa-learn 
manually:
#> sa-learn --spam /home/rikjoh/Mail/missed_spam/cur/
Learned from 710 message(s) (2148 message(s) examined).
I also run "sa-learn --ham" on a couple of folders
(which brings me to a script question: How can i make sa-learn scan all "ham" 
folders automaticly? There are 103 of them scattered all under my ~/Mail 
folder...
(eg. Mail/.Computer related.directory/QNX/cur, 
Mail/.Computer related.directory/.Linux.directory/SuSE/cur etc. etc.))
Just list them all.  IIRC, sa-learn will acept multiple directories as
arguments.  Just do

sa-learn --ham \
	/path/to/dir1/cur \
	/path/to/dir2/cur \
	/path/to/dir3/cur

Or, if they happen to have a common structure, you can just do something
like:
	sa-learn --ham `find /path/to/hamfolders -type d -name cur`

Probably easier, though, would be to copy your good messages to another
folder, run sa-learn --ham weekly or so, and then clear that folder out.
It'll automatically learn from messages that score low enough and high
enough, so there's little reason to train the bayesian filter a lot unless
you get lots of false positives or false negatives.

--Danny