[9.0] How can i tell if Spamassassin is learning?

newer
RE: [SLE] New Kernel compilation...

Rikard Johnels

20 Aug 2004 20 Aug '04

12:03

Hi all! How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam. -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >

Show replies by date

Danny Sauer

20 Aug 20 Aug

15:23

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

Rikard wrote regarding '[SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 07:03:

...

Hi all!

How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.

The bayesian filter in only part of the weighted score a spam sees. Do you have long reports enabled? If not, turn those on and see if the probability the a message is spam according to the bayes DB goes up. You may also look at the spam score in the headers. If you're getting a lot of spam that's scored 4.9, you might move your threshold down to 4 instead of leaving it at 5... Note that the Bayes DB needs to learn from spam *and* ham to work well. If you haven't trained it with roughly equal amounts of ham and spam, it's not going to work well. Also, if it hasn't seen on the order of a few thousand of each message, it's not going to be working to its full potential. It takes time and lots of experience for it to learn, much like most things. :) I know that doesn't directly answer your question, but maybe it helps none the less. If sa-learn says it processed all of those messages and doesn't throw an error, then it worked. It will alert you if it doesn't work. --Danny

Rikard Johnels

16:11

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

On Friday 20 August 2004 17.23, Danny Sauer wrote:

...

Rikard wrote regarding '[SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 07:03:

...
Hi all!

How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.

The bayesian filter in only part of the weighted score a spam sees. Do you have long reports enabled? If not, turn those on and see if the probability the a message is spam according to the bayes DB goes up. You may also look at the spam score in the headers. If you're getting a lot of spam that's scored 4.9, you might move your threshold down to 4 instead of leaving it at 5...

Note that the Bayes DB needs to learn from spam *and* ham to work well. If you haven't trained it with roughly equal amounts of ham and spam, it's not going to work well. Also, if it hasn't seen on the order of a few thousand of each message, it's not going to be working to its full potential. It takes time and lots of experience for it to learn, much like most things. :)

I know that doesn't directly answer your question, but maybe it helps none the less. If sa-learn says it processed all of those messages and doesn't throw an error, then it worked. It will alert you if it doesn't work.

--Danny

How do i enable "long reports", And where can i read those reports? The missed spams vary between 1.5 to almost 5 (my threshold is set to 5) I keep teaching SA about once a week. I move all missed spam manually to a specific mailfolder and run sa-learn manually: #> sa-learn --spam /home/rikjoh/Mail/missed_spam/cur/ Learned from 710 message(s) (2148 message(s) examined). I also run "sa-learn --ham" on a couple of folders (which brings me to a script question: How can i make sa-learn scan all "ham" folders automaticly? There are 103 of them scattered all under my ~/Mail folder... (eg. Mail/.Computer related.directory/QNX/cur, Mail/.Computer related.directory/.Linux.directory/SuSE/cur etc. etc.)) -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >

Danny Sauer

19:38

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

Rikard wrote regarding 'Re: [SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 11:12:

...

On Friday 20 August 2004 17.23, Danny Sauer wrote:

...
Rikard wrote regarding '[SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 07:03:

...
Hi all!

How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.

The bayesian filter in only part of the weighted score a spam sees. Do you have long reports enabled? If not, turn those on and see if the probability the a message is spam according to the bayes DB goes up. You may also look at the spam score in the headers. If you're getting a lot of spam that's scored 4.9, you might move your threshold down to 4 instead of leaving it at 5...

Note that the Bayes DB needs to learn from spam *and* ham to work well. If you haven't trained it with roughly equal amounts of ham and spam, it's not going to work well. Also, if it hasn't seen on the order of a few thousand of each message, it's not going to be working to its full potential. It takes time and lots of experience for it to learn, much like most things. :)

I know that doesn't directly answer your question, but maybe it helps none the less. If sa-learn says it processed all of those messages and doesn't throw an error, then it worked. It will alert you if it doesn't work.

--Danny

How do i enable "long reports", And where can i read those reports?

In /etc/mail/spamassassin/local.cf, set report_safe to 1 or 2 (1 is more reasonable, probably), and add a line like add_header all Report _REPORT_ to that file. Then, you'll get an additional header in all of your spams and non spams reporting on all of the tests applied to the message whether it's marked as spam or not.

...

The missed spams vary between 1.5 to almost 5 (my threshold is set to 5) I keep teaching SA about once a week.

perldoc Mail::SpamAssassin::Conf is good reading. You have to have at least 200 ham and 200 spam messages in bayes before it's even used. You may make sure you're at the point... :)

...

I move all missed spam manually to a specific mailfolder and run sa-learn manually:

#> sa-learn --spam /home/rikjoh/Mail/missed_spam/cur/ Learned from 710 message(s) (2148 message(s) examined).

I also run "sa-learn --ham" on a couple of folders (which brings me to a script question: How can i make sa-learn scan all "ham" folders automaticly? There are 103 of them scattered all under my ~/Mail folder... (eg. Mail/.Computer related.directory/QNX/cur, Mail/.Computer related.directory/.Linux.directory/SuSE/cur etc. etc.))

Just list them all. IIRC, sa-learn will acept multiple directories as arguments. Just do sa-learn --ham \ /path/to/dir1/cur \ /path/to/dir2/cur \ /path/to/dir3/cur Or, if they happen to have a common structure, you can just do something like: sa-learn --ham `find /path/to/hamfolders -type d -name cur` Probably easier, though, would be to copy your good messages to another folder, run sa-learn --ham weekly or so, and then clear that folder out. It'll automatically learn from messages that score low enough and high enough, so there's little reason to train the bayesian filter a lot unless you get lots of false positives or false negatives. --Danny

Jim Sabatke

16:26

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

Rikard Johnels wrote:

...

Hi all!

How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.

I'm sure you can get some fine help here, but you are more likely to find useful answers on the spamassassin list. It is very active and responsive. Good luck! -- Jim Sabatke Hire Me!! - See my resume at http://my.execpc.com/~jsabatke Do not meddle in the affairs of Dragons, for you are crunchy and good with ketchup. NOTE: Please do not email me any attachments with Microsoft extensions. They are deleted on my ISP's server before I ever see them, and no bounce message is sent.

Steve King

18:33

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 20 August 2004 13:03, Rikard Johnels wrote:

...

Hi all!

How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.

I've been using spamassassin for sometime. It has always been pretty good at trapping spam and at not incorrectly trapping good mail. But the "learning" part of it seemed to kick in after several months' use. I think I read somewhere that it waits until it has analysed quite a lot of mail and then starts to act on its learning. When it does act on its learning, you'll see things like "BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000]" in your trapped email. Steve Dundee, UK -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBJkQE94mqX5AIfgARAtrCAJ0R2IEvWexL1vRdzxK/qh1P57bEQwCeJMD0 8Invskib7R7700RKVYNWuQ4= =vUbr -----END PGP SIGNATURE-----

Rikard Johnels

22:16

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

On Friday 20 August 2004 20.33, Steve King wrote:

...

On Friday 20 August 2004 13:03, Rikard Johnels wrote:

...
Hi all!

How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.

I've been using spamassassin for sometime. It has always been pretty good at trapping spam and at not incorrectly trapping good mail. But the "learning" part of it seemed to kick in after several months' use. I think I read somewhere that it waits until it has analysed quite a lot of mail and then starts to act on its learning. When it does act on its learning, you'll see things like "BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000]" in your trapped email.

Steve Dundee, UK Wierd...! All of a sudden there is no BAYES_99 check.. I HAD it before. spamd is running: 5408 ? S 0:04 /usr/bin/spamd -d

Hmm... And whats more: I have set auto_learn 1 and yet: X-Spam-Status: No, hits=3.1 required=5.0 tests=NO_REAL_NAME, UNWANTED_LANGUAGE_BODY autolearn=no version=2.61 (autolearn=0) -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >

Carlos E. R.

23:34

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

The Saturday 2004-08-21 at 00:16 +0200, Rikard Johnels wrote:

...

and yet: X-Spam-Status: No, hits=3.1 required=5.0 tests=NO_REAL_NAME, UNWANTED_LANGUAGE_BODY autolearn=no version=2.61

(autolearn=0)

Autolearn triggers only above and below certain values, that is, when it is already absolutely sure it is spam or ham. 3.1 is doubtful for that purpose. -- Cheers, Carlos Robinson

Rikard Johnels

21 Aug 21 Aug

12:16

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

On Saturday 21 August 2004 01.34, Carlos E. R. wrote:

...

The Saturday 2004-08-21 at 00:16 +0200, Rikard Johnels wrote:

...
and yet: X-Spam-Status: No, hits=3.1 required=5.0 tests=NO_REAL_NAME, UNWANTED_LANGUAGE_BODY autolearn=no version=2.61

(autolearn=0)

Autolearn triggers only above and below certain values, that is, when it is already absolutely sure it is spam or ham. 3.1 is doubtful for that purpose.

-- Cheers, Carlos Robinson

I was referring to the "autolearn=no" -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >

Carlos E. R.

22:06

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

The Saturday 2004-08-21 at 14:16 +0200, Rikard Johnels wrote:

...

...
Autolearn triggers only above and below certain values, that is, when it is already absolutely sure it is spam or ham. 3.1 is doubtful for that purpose.

I was referring to the "autolearn=no"

Me too. It is correct. See this header of an email that was not detected as spam: X-Spam-Status: No, hits=3.2 required=5.0 tests=BAYES_80,BIZ_TLD autolearn=no version=2.63 and then see the header from your own email: X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=2.63 Your's triggers the autolearn as 'ham'. Now see this one that is clearly detected as spam: X-Spam-Status: Yes, hits=13.8 required=5.0 tests=BAYES_99,DATE_SPAMWARE_Y2K, FORGED_MUA_OUTLOOK,MISSING_MIMEOLE autolearn=no version=2.63 See? It doesn't trigger. Now, one that does trigger: X-Spam-Status: Yes, hits=34.1 required=5.0 tests=BANG_EXERCISE,BANG_GUARANTEE, BAYES_99,CLICK_BELOW_CAPS,DATE_SPAMWARE_Y2K,FORGED_MUA_OUTLOOK, FORGED_OUTLOOK_TAGS,FROM_ENDS_IN_NUMS,GUARANTEED_STUFF,HTML_60_70, HTML_FONTCOLOR_BLUE,HTML_FONTCOLOR_RED,HTML_FONT_BIG, HTML_IMAGE_ONLY_10,HTML_LINK_CLICK_CAPS,HTML_LINK_CLICK_HERE, HTML_MESSAGE,IMPOTENCE,MIME_HTML_NO_CHARSET,MIME_HTML_ONLY, MIME_HTML_ONLY_MULTI,MISSING_MIMEOLE,MONEY_BACK,NO_COST,PENIS_ENLARGE, PENIS_ENLARGE2 autolearn=spam version=2.63 Does this clarify my statement from yesterday? -- Cheers, Carlos Robinson

Rikard Johnels

23 Aug 23 Aug

11:04

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

On Sunday 22 August 2004 00.06, Carlos E. R. wrote:

...

X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50

Carlos mail gives: X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50 A caught spam (10.4 points, 5.0 required)gives: X-Spam-Status: Yes, hits=10.4 required=5.0 tests=CLICK_BELOW, FROM_ENDS_IN_NUMS,FROM_OFFERS,HTML_60_70,HTML_IMAGE_ONLY_12, HTML_IMAGE_RATIO_06,HTML_LINK_CLICK_HERE,HTML_MESSAGE, HTML_TAG_BALANCE_HTML,NO_OBLIGATION,URI_OFFERS autolearn=no version=2.61 So it seems my spamd isnt 20-20 :( How can i check it further? -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >

Carlos E. R.

22:51

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

The Monday 2004-08-23 at 13:04 +0200, Rikard Johnels wrote:

...

Carlos mail gives: X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50

On my system, that one gives: X-Spam-Status: No, hits=-4.8 required=5.0 tests=AWL,BAYES_00,UPPERCASE_25_50 autolearn=no version=2.63 The differences are: 1) You have enabled some "online" tests with real time black lists (I think). 2) Your Bayes filter is not trainned very well, as it giving it a 44 percentage, and I'm not a spammer ;-) You should retrain your bayessian filter with mail that you know it is spam. For that, I simply delete the '.spamassassin/bayes*' files, and then I run agan sa-learn; in my case, I do: time nice sa-learn --showdots --spam --mbox Mail/file/z_spam_unrecog && \ time nice sa-learn --rebuild You'd have to change the mailbox location, at least, of course. -- Cheers, Carlos Robinson

Rikard Johnels

24 Aug 24 Aug

04:18

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

On Tuesday 24 August 2004 00.51, Carlos E. R. wrote:

...

The Monday 2004-08-23 at 13:04 +0200, Rikard Johnels wrote:

...
Carlos mail gives: X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50

On my system, that one gives:

X-Spam-Status: No, hits=-4.8 required=5.0 tests=AWL,BAYES_00,UPPERCASE_25_50 autolearn=no version=2.63

The differences are: 1) You have enabled some "online" tests with real time black lists (I think). 2) Your Bayes filter is not trainned very well, as it giving it a 44 percentage, and I'm not a spammer ;-)

You should retrain your bayessian filter with mail that you know it is spam. For that, I simply delete the '.spamassassin/bayes*' files, and then I run agan sa-learn; in my case, I do: And some 700 ham. time nice sa-learn --showdots --spam --mbox Mail/file/z_spam_unrecog && \ time nice sa-learn --rebuild

You'd have to change the mailbox location, at least, of course.

-- Cheers, Carlos Robinson

This is the config i use. # SpamAssassin config file for version 2.5x # generated by http://www.yrex.com/spam/spamconfig.php (version 1.01) required_hits 5.0 rewrite_subject 0 subject_tag *****SPAM***** report_safe 1 use_terse_report 0 use_bayes 1 auto_learn 1 bayes_path /home/bayesdatabase/bayes skip_rbl_checks 0 use_razor2 1 use_dcc 1 use_pyzor 1 ok_languages en sv ok_locales en add_header all Report _REPORT_ I ran about 1000 spam and 700 ham thru sa-learn in total I'll reset the database today, and see if anything changes. But i am still worried by the fact that bayesian checks are missing so often... -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >

Carlos E. R.

11:23

New subject: [SLE] [9.0] How can i tell if Spamassassin is learning?

The Tuesday 2004-08-24 at 06:18 +0200, Rikard Johnels wrote:

...

This is the config i use.

# SpamAssassin config file for version 2.5x # generated by http://www.yrex.com/spam/spamconfig.php (version 1.01) required_hits 5.0 rewrite_subject 0 subject_tag *****SPAM***** report_safe 1 use_terse_report 0 use_bayes 1 auto_learn 1 bayes_path /home/bayesdatabase/bayes skip_rbl_checks 0 use_razor2 1 use_dcc 1 use_pyzor 1 ok_languages en sv ok_locales en add_header all Report _REPORT_

My configuration file (/home/cer/.spamassassin/user_prefs) is almost empty, I use the global defaults (/etc/mail/spamassassin/local.cf). I only keep there 'whitelist' statements. And my '/etc/mail/spamassassin/local.cf' only has: use_terse_report 1 report_safe 1 System configuration is in /usr/share/spamassassin/*, I think. I have not touched it.

...

I ran about 1000 spam and 700 ham thru sa-learn in total

I'll reset the database today, and see if anything changes. But i am still worried by the fact that bayesian checks are missing so often...

I don't know. It depends on who/what is triggering the spamassassin filter. How are you doing it? A procmail recipe, amavis-new, what? -- Cheers, Carlos Robinson

7191

Age (days ago)

7195

Last active (days ago)

List overview

Download

13 comments

5 participants

participants (5)

Carlos E. R.
Danny Sauer
Jim Sabatke
Rikard Johnels
Steve King