Re: [SLE] OT - SA defeating thesauri

31 Dec 2003

      On Wed, 31 Dec 2003 16:32:26 +0000
Dylan  wrote:
...
On Wednesday 31 December 2003 14:51 pm, Nick Selby wrote:
<SNIP>
...
I guess that they're doing this to increase the message size with
un-spamlike words to decrease the ratio of spam-like words to
non-spam-like words? Does this sound right?
Yes, that sounds quite plausible. What they are also doing is
skewing the ratio of content-to-function words, in a grammatical
sense. The ratio is relatively constant for a given language (for
English approx 25-35% function words, like it, is, that, ...) so a
list of purely content words would likely be asy to identify - it
having a 0% score of function words.
Give a look to this for more advanced bayesian filters, this takes
into account neighborhood too.

Smarter tokenization systems may help as well.

Re: [SLE] OT - SA defeating thesauri

Ivan Sergio Borgonovo