Thread

Index > Scribe > Bayesian filter query
Author/Date Bayesian filter query
Mike Green
09/09/2004 4:28pm
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I'd like a bit of clarification on the way that the Bayesian
filtering works in InScribe. I've read the help and understand
that the SPAM list is updated when initiating re-scanning and
that this requires that all spam be left in the spam folder.
Fine. It sounds, however, as if the same applies to HAM. i.e.
the HAM wdb is built up from scratch each time the word list
update function is run. Is this correct?

If this IS the case, and it seems like it from a couple of
experiments I've done, then this seems to be a bit of an issue
it that I personally tend to delete most mail pretty much
straight away (maybe that's abnormal or aberrant behaviour?!).
This isn't to say it's spam, it just means that I don't want to
retain it for any reason. Net result is that the HAM file is
very small, and rather prone to frequent, on-going change.

I don't claim to understand Bayesian filtering all that well,
but doesn't it rely on checking for positive words as well as
recognising 'negative', spam-like patterns? If so then my
failure to keep much HAM will weaken the filter - but is that a
significant problem?

Just seeking clarification here really!

Thanks,

Mike

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0.3 - not licensed for commercial use: www.pgp.com

iQA/AwUBQUCEsPVnEmFoUW40EQJLKwCgwiezoOTSv1z0l8KMYyNBMjWw/U8AnjJc
9+6gPajUhVlSlPZVv85NZVr1
=D8lU
-----END PGP SIGNATURE-----
fReT
09/09/2004 9:55pm
You are right the ham must remain for the filter to work correctly. I'm working on a incremental version of the algorithm so that you can delete mail and still retain the word counts.
Mike Green
09/09/2004 10:08pm
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So does the ham have a significant effect on the efficiency of
the filter? ie. if there isn't much of it will the filter skew,
once trained with lots of spam, towards false positives rather
more than it would if I didn't delete just about everything
rather than retain it? (My ham db is anyway mostly meaningless
strings since most of my mail is encrypted and that's what the
ham db is picking up.)

Excellent program by the way. Far more elegant than most mail
programs and the fact that I can install it on a USB drive
along with encryption software is a very major plus point. This
query is largely out of interest, not something I'd regard as
important. Now the other one, regarding attributed quoting,
would be a really nice to have, it's the one thing 'The Bat!'
does which could, but probably won't, lure me towards its use.
I'll keep my fingers crossed that it'll be easy to program :-)

Thanks for the response.

Mike

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0.3 - not licensed for commercial use: www.pgp.com

iQA/AwUBQUDUYPVnEmFoUW40EQIIRwCgvIy893OoHYohmbDN+bsxs3H8LbEAn2xs
ofdpu+Jue1iupYH96Cv6xvud
=TuE8
-----END PGP SIGNATURE-----
Justin Heiner
17/09/2004 7:54am
I use the ham folder for basically a whitelist. Move all the e-mails that are wrongly marked as spam into ham and all e-mails from then on from that sender are correctly placed.

Probably ham's intended usage, i'm guessing :)
Reply