Re: [SLE] Spam

3 Feb 2006

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The Friday 2006-02-03 at 12:33 +0100, Per Jessen wrote:
...
Carlos E. R. wrote:
...
(I forgot to say that many of those false positives are from
newsletters).
Same here.  I'm in the process of building bayes-style filters that are
meant for recognising just newsletters.  That way I'll be able to add
perhaps a couple of points, stopping a newsletter from ending up as a
false positive.
Ah... I simply use "whitelist_from" in the file .spamassassin/user_prefs. 
It is faster to use, for a limited number of senders. The snag is that I 
could get faked newsletters instead.
...
...
DNS_FROM_RFC_WHOIS 0      0.879  0     1.447  Envelope sender in
whois.rfc-ignorant.org
I don't use rfc-ignorant other than as an indicator of a possibly dodgy
server.  Given that number of poorly configured mail-servers, using
rfc-ignorant is a very agressive step, IMHO.
And my HO too. Unfortunately, they are active by default in the 
spamassassin configuration that SuSE (and us users) uses. 

Also, I think that quite some of those tests are redundant: if one RBL 
says that an IP or a domain is bad, some others will say the same. But 
that, I think, doesn't necessarily mean that the email spammines is higher. 
Those scores should not be arithmetically added, but some other type of 
algorithm should be used. Don't know what but kind of:

only A says it's bad  --> X points
only B says it's bad  --> Y points
A and B says it's bad --> W points.

where W should be perhaps the average or the maximum of (X, Y), but not 
the sum.

The result of an IP being listed on a dozen black lists could mean that 
all think the same, or that all copy data; I rather think it means that it 
is very probably true that that IP or domain is bad, but it doesn't mean 
that the probability of being spam is 500%.

IMO, of course :-)
...
...
Even lower. SuSE must be using very altered values. And a badly
trained Bayesian database: mine scores that same email at 5%, not 95%.
Bayes is a double-edged sword - you've got to be very particular about
what you record as spam/ham.  Especially if you're not just training
your bayes filters for purely personal use.  And you've got to be
careful with cleaning up the database too.
Very true.

I doubt the usefulness of site-wide Bayesian databases. Also, I 
disabled autolearn for the same reason.

- -- 
Cheers,
       Carlos Robinson
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Made with pgp4pine 1.76

iD8DBQFD43LftTMYHG2NR9URAparAJ9srkz/xHpnMYZtfHX0js2Ko14DPwCfeC/I
EEXgtHXrpvFZ6ha049moFtc=
=YRIe
-----END PGP SIGNATURE-----