[9.0] How can i tell if Spamassassin is learning?
Hi all! How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam. -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >
Rikard wrote regarding '[SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 07:03:
Hi all!
How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.
The bayesian filter in only part of the weighted score a spam sees. Do you have long reports enabled? If not, turn those on and see if the probability the a message is spam according to the bayes DB goes up. You may also look at the spam score in the headers. If you're getting a lot of spam that's scored 4.9, you might move your threshold down to 4 instead of leaving it at 5... Note that the Bayes DB needs to learn from spam *and* ham to work well. If you haven't trained it with roughly equal amounts of ham and spam, it's not going to work well. Also, if it hasn't seen on the order of a few thousand of each message, it's not going to be working to its full potential. It takes time and lots of experience for it to learn, much like most things. :) I know that doesn't directly answer your question, but maybe it helps none the less. If sa-learn says it processed all of those messages and doesn't throw an error, then it worked. It will alert you if it doesn't work. --Danny
On Friday 20 August 2004 17.23, Danny Sauer wrote:
Rikard wrote regarding '[SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 07:03:
Hi all!
How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.
The bayesian filter in only part of the weighted score a spam sees. Do you have long reports enabled? If not, turn those on and see if the probability the a message is spam according to the bayes DB goes up. You may also look at the spam score in the headers. If you're getting a lot of spam that's scored 4.9, you might move your threshold down to 4 instead of leaving it at 5...
Note that the Bayes DB needs to learn from spam *and* ham to work well. If you haven't trained it with roughly equal amounts of ham and spam, it's not going to work well. Also, if it hasn't seen on the order of a few thousand of each message, it's not going to be working to its full potential. It takes time and lots of experience for it to learn, much like most things. :)
I know that doesn't directly answer your question, but maybe it helps none the less. If sa-learn says it processed all of those messages and doesn't throw an error, then it worked. It will alert you if it doesn't work.
--Danny
How do i enable "long reports", And where can i read those reports? The missed spams vary between 1.5 to almost 5 (my threshold is set to 5) I keep teaching SA about once a week. I move all missed spam manually to a specific mailfolder and run sa-learn manually: #> sa-learn --spam /home/rikjoh/Mail/missed_spam/cur/ Learned from 710 message(s) (2148 message(s) examined). I also run "sa-learn --ham" on a couple of folders (which brings me to a script question: How can i make sa-learn scan all "ham" folders automaticly? There are 103 of them scattered all under my ~/Mail folder... (eg. Mail/.Computer related.directory/QNX/cur, Mail/.Computer related.directory/.Linux.directory/SuSE/cur etc. etc.)) -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >
Rikard wrote regarding 'Re: [SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 11:12:
On Friday 20 August 2004 17.23, Danny Sauer wrote:
Rikard wrote regarding '[SLE] [9.0] How can i tell if Spamassassin is learning?' on Fri, Aug 20 at 07:03:
Hi all!
How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.
The bayesian filter in only part of the weighted score a spam sees. Do you have long reports enabled? If not, turn those on and see if the probability the a message is spam according to the bayes DB goes up. You may also look at the spam score in the headers. If you're getting a lot of spam that's scored 4.9, you might move your threshold down to 4 instead of leaving it at 5...
Note that the Bayes DB needs to learn from spam *and* ham to work well. If you haven't trained it with roughly equal amounts of ham and spam, it's not going to work well. Also, if it hasn't seen on the order of a few thousand of each message, it's not going to be working to its full potential. It takes time and lots of experience for it to learn, much like most things. :)
I know that doesn't directly answer your question, but maybe it helps none the less. If sa-learn says it processed all of those messages and doesn't throw an error, then it worked. It will alert you if it doesn't work.
--Danny
How do i enable "long reports", And where can i read those reports?
In /etc/mail/spamassassin/local.cf, set report_safe to 1 or 2 (1 is more reasonable, probably), and add a line like add_header all Report _REPORT_ to that file. Then, you'll get an additional header in all of your spams and non spams reporting on all of the tests applied to the message whether it's marked as spam or not.
The missed spams vary between 1.5 to almost 5 (my threshold is set to 5) I keep teaching SA about once a week.
perldoc Mail::SpamAssassin::Conf is good reading. You have to have at least 200 ham and 200 spam messages in bayes before it's even used. You may make sure you're at the point... :)
I move all missed spam manually to a specific mailfolder and run sa-learn manually:
#> sa-learn --spam /home/rikjoh/Mail/missed_spam/cur/ Learned from 710 message(s) (2148 message(s) examined).
I also run "sa-learn --ham" on a couple of folders (which brings me to a script question: How can i make sa-learn scan all "ham" folders automaticly? There are 103 of them scattered all under my ~/Mail folder... (eg. Mail/.Computer related.directory/QNX/cur, Mail/.Computer related.directory/.Linux.directory/SuSE/cur etc. etc.))
Just list them all. IIRC, sa-learn will acept multiple directories as arguments. Just do sa-learn --ham \ /path/to/dir1/cur \ /path/to/dir2/cur \ /path/to/dir3/cur Or, if they happen to have a common structure, you can just do something like: sa-learn --ham `find /path/to/hamfolders -type d -name cur` Probably easier, though, would be to copy your good messages to another folder, run sa-learn --ham weekly or so, and then clear that folder out. It'll automatically learn from messages that score low enough and high enough, so there's little reason to train the bayesian filter a lot unless you get lots of false positives or false negatives. --Danny
Rikard Johnels wrote:
Hi all!
How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.
I'm sure you can get some fine help here, but you are more likely to find useful answers on the spamassassin list. It is very active and responsive. Good luck! -- Jim Sabatke Hire Me!! - See my resume at http://my.execpc.com/~jsabatke Do not meddle in the affairs of Dragons, for you are crunchy and good with ketchup. NOTE: Please do not email me any attachments with Microsoft extensions. They are deleted on my ISP's server before I ever see them, and no bounce message is sent.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 20 August 2004 13:03, Rikard Johnels wrote:
Hi all!
How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.
I've been using spamassassin for sometime. It has always been pretty good at trapping spam and at not incorrectly trapping good mail. But the "learning" part of it seemed to kick in after several months' use. I think I read somewhere that it waits until it has analysed quite a lot of mail and then starts to act on its learning. When it does act on its learning, you'll see things like "BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000]" in your trapped email. Steve Dundee, UK -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBJkQE94mqX5AIfgARAtrCAJ0R2IEvWexL1vRdzxK/qh1P57bEQwCeJMD0 8Invskib7R7700RKVYNWuQ4= =vUbr -----END PGP SIGNATURE-----
On Friday 20 August 2004 20.33, Steve King wrote:
On Friday 20 August 2004 13:03, Rikard Johnels wrote:
Hi all!
How can i determine if SA actually is learning via sa-learn? I get a message that it processed xx files but it keeps missing out on the same types of mails i have fed it some 10 times... It only catches approx 10-20% of the spam i am receiving. I have a bayes database and the contents in it changes after a sa-learn, but it still fails to recognize spam.
I've been using spamassassin for sometime. It has always been pretty good at trapping spam and at not incorrectly trapping good mail. But the "learning" part of it seemed to kick in after several months' use. I think I read somewhere that it waits until it has analysed quite a lot of mail and then starts to act on its learning. When it does act on its learning, you'll see things like "BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000]" in your trapped email.
Steve Dundee, UK Wierd...! All of a sudden there is no BAYES_99 check.. I HAD it before. spamd is running: 5408 ? S 0:04 /usr/bin/spamd -d
Hmm... And whats more: I have set auto_learn 1 and yet: X-Spam-Status: No, hits=3.1 required=5.0 tests=NO_REAL_NAME, UNWANTED_LANGUAGE_BODY autolearn=no version=2.61 (autolearn=0) -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >
The Saturday 2004-08-21 at 00:16 +0200, Rikard Johnels wrote:
and yet: X-Spam-Status: No, hits=3.1 required=5.0 tests=NO_REAL_NAME, UNWANTED_LANGUAGE_BODY autolearn=no version=2.61
(autolearn=0)
Autolearn triggers only above and below certain values, that is, when it is already absolutely sure it is spam or ham. 3.1 is doubtful for that purpose. -- Cheers, Carlos Robinson
On Saturday 21 August 2004 01.34, Carlos E. R. wrote:
The Saturday 2004-08-21 at 00:16 +0200, Rikard Johnels wrote:
and yet: X-Spam-Status: No, hits=3.1 required=5.0 tests=NO_REAL_NAME, UNWANTED_LANGUAGE_BODY autolearn=no version=2.61
(autolearn=0)
Autolearn triggers only above and below certain values, that is, when it is already absolutely sure it is spam or ham. 3.1 is doubtful for that purpose.
-- Cheers, Carlos Robinson
I was referring to the "autolearn=no" -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >
The Saturday 2004-08-21 at 14:16 +0200, Rikard Johnels wrote:
Autolearn triggers only above and below certain values, that is, when it is already absolutely sure it is spam or ham. 3.1 is doubtful for that purpose.
I was referring to the "autolearn=no"
Me too. It is correct. See this header of an email that was not detected as spam: X-Spam-Status: No, hits=3.2 required=5.0 tests=BAYES_80,BIZ_TLD autolearn=no version=2.63 and then see the header from your own email: X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=2.63 Your's triggers the autolearn as 'ham'. Now see this one that is clearly detected as spam: X-Spam-Status: Yes, hits=13.8 required=5.0 tests=BAYES_99,DATE_SPAMWARE_Y2K, FORGED_MUA_OUTLOOK,MISSING_MIMEOLE autolearn=no version=2.63 See? It doesn't trigger. Now, one that does trigger: X-Spam-Status: Yes, hits=34.1 required=5.0 tests=BANG_EXERCISE,BANG_GUARANTEE, BAYES_99,CLICK_BELOW_CAPS,DATE_SPAMWARE_Y2K,FORGED_MUA_OUTLOOK, FORGED_OUTLOOK_TAGS,FROM_ENDS_IN_NUMS,GUARANTEED_STUFF,HTML_60_70, HTML_FONTCOLOR_BLUE,HTML_FONTCOLOR_RED,HTML_FONT_BIG, HTML_IMAGE_ONLY_10,HTML_LINK_CLICK_CAPS,HTML_LINK_CLICK_HERE, HTML_MESSAGE,IMPOTENCE,MIME_HTML_NO_CHARSET,MIME_HTML_ONLY, MIME_HTML_ONLY_MULTI,MISSING_MIMEOLE,MONEY_BACK,NO_COST,PENIS_ENLARGE, PENIS_ENLARGE2 autolearn=spam version=2.63 Does this clarify my statement from yesterday? -- Cheers, Carlos Robinson
On Sunday 22 August 2004 00.06, Carlos E. R. wrote:
X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50
Carlos mail gives: X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50 A caught spam (10.4 points, 5.0 required)gives: X-Spam-Status: Yes, hits=10.4 required=5.0 tests=CLICK_BELOW, FROM_ENDS_IN_NUMS,FROM_OFFERS,HTML_60_70,HTML_IMAGE_ONLY_12, HTML_IMAGE_RATIO_06,HTML_LINK_CLICK_HERE,HTML_MESSAGE, HTML_TAG_BALANCE_HTML,NO_OBLIGATION,URI_OFFERS autolearn=no version=2.61 So it seems my spamd isnt 20-20 :( How can i check it further? -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >
The Monday 2004-08-23 at 13:04 +0200, Rikard Johnels wrote:
Carlos mail gives: X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50
On my system, that one gives: X-Spam-Status: No, hits=-4.8 required=5.0 tests=AWL,BAYES_00,UPPERCASE_25_50 autolearn=no version=2.63 The differences are: 1) You have enabled some "online" tests with real time black lists (I think). 2) Your Bayes filter is not trainned very well, as it giving it a 44 percentage, and I'm not a spammer ;-) You should retrain your bayessian filter with mail that you know it is spam. For that, I simply delete the '.spamassassin/bayes*' files, and then I run agan sa-learn; in my case, I do: time nice sa-learn --showdots --spam --mbox Mail/file/z_spam_unrecog && \ time nice sa-learn --rebuild You'd have to change the mailbox location, at least, of course. -- Cheers, Carlos Robinson
On Tuesday 24 August 2004 00.51, Carlos E. R. wrote:
The Monday 2004-08-23 at 13:04 +0200, Rikard Johnels wrote:
Carlos mail gives: X-Spam-Status: No, hits=1.8 tagged_above=-20.0 required=5.0 tests=BAYES_44, RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL, UPPERCASE_25_50
On my system, that one gives:
X-Spam-Status: No, hits=-4.8 required=5.0 tests=AWL,BAYES_00,UPPERCASE_25_50 autolearn=no version=2.63
The differences are: 1) You have enabled some "online" tests with real time black lists (I think). 2) Your Bayes filter is not trainned very well, as it giving it a 44 percentage, and I'm not a spammer ;-)
You should retrain your bayessian filter with mail that you know it is spam. For that, I simply delete the '.spamassassin/bayes*' files, and then I run agan sa-learn; in my case, I do: And some 700 ham. time nice sa-learn --showdots --spam --mbox Mail/file/z_spam_unrecog && \ time nice sa-learn --rebuild
You'd have to change the mailbox location, at least, of course.
-- Cheers, Carlos Robinson
This is the config i use. # SpamAssassin config file for version 2.5x # generated by http://www.yrex.com/spam/spamconfig.php (version 1.01) required_hits 5.0 rewrite_subject 0 subject_tag *****SPAM***** report_safe 1 use_terse_report 0 use_bayes 1 auto_learn 1 bayes_path /home/bayesdatabase/bayes skip_rbl_checks 0 use_razor2 1 use_dcc 1 use_pyzor 1 ok_languages en sv ok_locales en add_header all Report _REPORT_ I ran about 1000 spam and 700 ham thru sa-learn in total I'll reset the database today, and see if anything changes. But i am still worried by the fact that bayesian checks are missing so often... -- /Rikard ------------------------------------------------------------------------------------ Rikard Johnels email : rikjoh@norweb.se Web : http://www.rikjoh.com Mob : +46 735 05 51 01 ------------------------ Public PGP fingerprint ---------------------------- < 15 28 DF 78 67 98 B2 16 1F D3 FD C5 59 D4 B6 78 46 1C EE 56 >
The Tuesday 2004-08-24 at 06:18 +0200, Rikard Johnels wrote:
This is the config i use.
# SpamAssassin config file for version 2.5x # generated by http://www.yrex.com/spam/spamconfig.php (version 1.01) required_hits 5.0 rewrite_subject 0 subject_tag *****SPAM***** report_safe 1 use_terse_report 0 use_bayes 1 auto_learn 1 bayes_path /home/bayesdatabase/bayes skip_rbl_checks 0 use_razor2 1 use_dcc 1 use_pyzor 1 ok_languages en sv ok_locales en add_header all Report _REPORT_
My configuration file (/home/cer/.spamassassin/user_prefs) is almost empty, I use the global defaults (/etc/mail/spamassassin/local.cf). I only keep there 'whitelist' statements. And my '/etc/mail/spamassassin/local.cf' only has: use_terse_report 1 report_safe 1 System configuration is in /usr/share/spamassassin/*, I think. I have not touched it.
I ran about 1000 spam and 700 ham thru sa-learn in total
I'll reset the database today, and see if anything changes. But i am still worried by the fact that bayesian checks are missing so often...
I don't know. It depends on who/what is triggering the spamassassin filter. How are you doing it? A procmail recipe, amavis-new, what? -- Cheers, Carlos Robinson
participants (5)
-
Carlos E. R.
-
Danny Sauer
-
Jim Sabatke
-
Rikard Johnels
-
Steve King