On Wed, 31 Dec 2003 16:32:26 +0000
Dylan
On Wednesday 31 December 2003 14:51 pm, Nick Selby wrote: <SNIP>
I guess that they're doing this to increase the message size with un-spamlike words to decrease the ratio of spam-like words to non-spam-like words? Does this sound right?
Yes, that sounds quite plausible. What they are also doing is skewing the ratio of content-to-function words, in a grammatical sense. The ratio is relatively constant for a given language (for English approx 25-35% function words, like it, is, that, ...) so a list of purely content words would likely be asy to identify - it having a 0% score of function words.
Give a look to this for more advanced bayesian filters, this takes into account neighborhood too. Smarter tokenization systems may help as well.