So I was looking at PopFile‘s statistics page.
|Bucket||Classification Count||False Positives||False Negatives|
This is really interesting. Of my entire email volume of 38,596 emails (and this doesn’t include Gmail!) only 3,978 were spam – or about 10% of incoming messages are spam. But the word counts are more interesting:
Notice the distinct words in Personal and Spam are only off by 200?
But the best part:
99.47% – so .53% error rate. Not too bad. I use PopFile and it uses a Bayesian methodology to classify messages. See the link above and you can download and configure it.