On Wednesday 18 February 2004 04:18, Tom Allison wrote:
Despite it's (lack of) effectiveness. I think that the idea is potentially valuable. At this point you should probably say, "Then why don't you subscribe to the razor mailing list?" And I probably will for a spell.
I've been there for two or three years (on the razor list). But the issue of effectiveness is largely ignored. Regardless of how it is trained, Razor's basic design allows it only to catch allready known spam, whereas other filters can set the spami-ness of never before seen spam with suprizing effectiveness. -- _____________________________________ John Andersen
The Wednesday 2004-02-18 at 10:56 -0900, John Andersen wrote:
I've been there for two or three years (on the razor list). But the issue of effectiveness is largely ignored.
Regardless of how it is trained, Razor's basic design allows it only to catch allready known spam, whereas other filters can set the spami-ness of never before seen spam with suprizing effectiveness.
That posses a question. I'm getting a kind of spam (it started a few months ago) that is hard to catch by spamassassin: a link to an image (which I assume is the "message") and a more or less long paragraph full of random text, aimed at rendering bayessian filters useless. How can we best filter out those? I suppose something like "razor" should work, at least for those that do not get them first. But I have never tried razor. -- Cheers, Carlos Robinson
On Wednesday 18 February 2004 14:44, Carlos E. R. wrote:
The Wednesday 2004-02-18 at 10:56 -0900, John Andersen wrote:
I've been there for two or three years (on the razor list). But the issue of effectiveness is largely ignored.
Regardless of how it is trained, Razor's basic design allows it only to catch allready known spam, whereas other filters can set the spami-ness of never before seen spam with suprizing effectiveness.
That posses a question. I'm getting a kind of spam (it started a few months ago) that is hard to catch by spamassassin: a link to an image (which I assume is the "message") and a more or less long paragraph full of random text, aimed at rendering bayessian filters useless.
How can we best filter out those? I suppose something like "razor" should work, at least for those that do not get them first.
But I have never tried razor.
-- Cheers, Carlos Robinson
I read mail using Kmail. Anything that gets thru spamassassin but IS spam I manually move to a folder I created called missedspam Then every midnight a cronjob runs sa-learn against that folder and then deletes the contents. That trains the bayes filters and they are getting pretty good at spotting those. -- _____________________________________ John Andersen
The Wednesday 2004-02-18 at 20:30 -0900, John Andersen wrote:
Anything that gets thru spamassassin but IS spam I manually move to a folder I created called missedspam
Then every midnight a cronjob runs sa-learn against that folder and then deletes the contents. That trains the bayes filters and they are getting pretty good at spotting those.
I know that, and I do that; but it is useless for this kind of spam, it is designed to fool bayesian filters. This is one of them, see how they look: |Banned CD Government don't want me to sell it. See Now & | |[ads.jpg] pilgrim prefer pueblo italian route exacerbate athens retrieve |font canoe abate biotic armament cancellate cia bandpass cavemen anthem |disembowel judas decibel shoji attire macdougall gotham luminous turnip |swamp baghdad pomade alteration aye abode abode phrasemake impertinent |ironic unruly pater interviewee automorphism ouagadougou phlox mae spurn |future prolific beard godfrey handle allegra michel revile hence aurora |vertices ascent halfback arcadia chalk mcnulty caiman fictive breast |barbara defiant armpit censorious bizet madeira beatrice hypodermic snack |mink And the random text goes on a while more. You can not train a filter on random text! If we do, it will match randomly or not at all :-( The spam message (payload) is contained in the link "ads.jpg". The rest is there in order to fool the filters. See what spamassassin says of that one: |X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_56,HTML_MESSAGE, | MY_BANNED_CD,SUSPICIOUS_RECIPS autolearn=no version=2.61 Note that "MY_BANNED_CD" is a rule I added my self, if not the level is much lower. So, I ask again: how can we filter this kind of spam with random text? -- Cheers, Carlos Robinson
Quoting Carlos E. R.
And the random text goes on a while more. You can not train a filter on random text! If we do, it will match randomly or not at all :-(
I am running an experiment with spamprobe. It is a Bayesian filter that looks at single words (as do most) and pairs of words. It is doing a little better than SpamAssassin at catching these random word messages because both the frequency of the words and the pairs of words is slightly different for random and meaningful messages. Jeffrey
The Thursday 2004-02-19 at 18:01 -0600, Jeffrey L. Taylor wrote:
I am running an experiment with spamprobe. It is a Bayesian filter that looks at single words (as do most) and pairs of words. It is doing a little better than SpamAssassin at catching these random word messages because both the frequency of the words and the pairs of words is slightly different for random and meaningful messages.
Ah, that sounds promissing. Perhaps I'll wait till the spamassassin people catch up [...] I see they are at 2.63, and I have 2.61: I'll upgrade tomorrow, I think. Mmm, see this: |The Bayesian database support in Spamassassin tries to identify spam by |looking at what are called tokens; short phrases that are commonly found |in spam or ham. If I've handed 100 messages to sa-learn that have the |phrase penis enlargement and told it that those are all spam, when the |101st message comes in with the phrase penis enlargment, the Bayesian |code is pretty sure that the new message is spam and raises the spam |score of that message. So it is looking at phrases, not words. -- Cheers, Carlos Robinson
On Thursday 19 February 2004 16:45, Carlos E. R. wrote:
|The Bayesian database support in Spamassassin tries to identify spam by |looking at what are called tokens; short phrases that are commonly found |in spam or ham. If I've handed 100 messages to sa-learn that have the |phrase penis enlargement and told it that those are all spam, when the |101st message comes in with the phrase penis enlargment, the Bayesian |code is pretty sure that the new message is spam and raises the spam |score of that message.
So it is looking at phrases, not words.
And perhaps thats why it DOES DO a good job with the random word messages. They contain virtually no "if and or the a it is am are" etc.etc. Therefore its pretty plain that these random word messages are spam. -- _____________________________________ John Andersen
On Thursday 19 February 2004 13:37, Carlos E. R. wrote:
And the random text goes on a while more. You can not train a filter on random text! If we do, it will match randomly or not at all :-(
I wonder about that... I see more and more of these going to trash and no false positives. I haven't had a false positive in months. I've been sending these bayes poision thru sa-learn and at the same time half expecting false positives to start increasing, yet all I see is more effective detection of these random text files and still zero false positives. The thing about random text, is it will likely include words never used in my normal email such as "phrasemake". -- _____________________________________ John Andersen
The Thursday 2004-02-19 at 17:52 -0900, John Andersen wrote:
On Thursday 19 February 2004 13:37, Carlos E. R. wrote:
And the random text goes on a while more. You can not train a filter on random text! If we do, it will match randomly or not at all :-(
I wonder about that... I see more and more of these going to trash and no false positives. I haven't had a false positive in months.
I've been sending these bayes poision thru sa-learn and at the same time half expecting false positives to start increasing, yet all I see is more effective detection of these random text files and still zero false positives.
Ah, well, then I'll persevere training sa-learn. I get a number of them, not flagged as spam. Those containing normal text plus random text are detected, but those with only random text (and a link) are not - yet?
The thing about random text, is it will likely include words never used in my normal email such as "phrasemake".
Gosh, I though you had made it up, till I found "phrasemake" on the text I pasted -- you looked hard at it, eh? :-o X-) But you can not count on that too much, if several languages are involved (as in my case). -- Cheers, Carlos Robinson
Carlos E. R. wrote:
The Wednesday 2004-02-18 at 20:30 -0900, John Andersen wrote:
Anything that gets thru spamassassin but IS spam I manually move to a folder I created called missedspam
Then every midnight a cronjob runs sa-learn against that folder and then deletes the contents. That trains the bayes filters and they are getting pretty good at spotting those.
I know that, and I do that; but it is useless for this kind of spam, it is designed to fool bayesian filters. This is one of them, see how they look:
bogofilter catches these examples you sent as spam. I'm finding it a little tricky to set up, but it's showing itself to be very efficient. Even with blocks of words like that, you have to realize that it's still evident that it is spam. After all, these spam messages do not contain small words or ordinary words. They also do not contain works that I would use or receive in email. They are simply dictionary dumps. What you end up with is the assumption that everything from the dictionary is spam unless it is in the much smaller list of known good words, like suse, linux, rpm for this list.
Carlos E. R. wrote:
The Wednesday 2004-02-18 at 10:56 -0900, John Andersen wrote:
I've been there for two or three years (on the razor list). But the issue of effectiveness is largely ignored.
Regardless of how it is trained, Razor's basic design allows it only to catch allready known spam, whereas other filters can set the spami-ness of never before seen spam with suprizing effectiveness.
That posses a question. I'm getting a kind of spam (it started a few months ago) that is hard to catch by spamassassin: a link to an image (which I assume is the "message") and a more or less long paragraph full of random text, aimed at rendering bayessian filters useless.
How can we best filter out those? I suppose something like "razor" should work, at least for those that do not get them first.
But I have never tried razor.
bogofilter may be an option for you. One of the features that bogofilter has an advantage is that it will decode mime encoded email messages. I don't believe that spamassassin will do this. It also uses the links in html tags as tokens for determining spam. I've played with it for a few days and it seems excellent, but I think I hosed things up a bit this morning with a new install. My word database might not be valid anymore.
participants (4)
-
Carlos E. R.
-
Jeffrey L. Taylor
-
John Andersen
-
Tom Allison