These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.
Most content on this site is licensed under the WTFPL, version 2 (details).
Questions? Praise? Blame? Feel free to contact me.
My old blog (pre-2006) is also still available.
See also my Mastodon page.
Sun | Mon | Tue | Wed | Thu | Fri | Sat |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
(...), Arduino, AVR, BaRef, Blosxom, Book, Busy, C++, Charity, Debian, Electronics, Examination, Firefox, Flash, Framework, FreeBSD, Gnome, Hardware, Inter-Actief, IRC, JTAG, LARP, Layout, Linux, Madness, Mail, Math, MS-1013, Mutt, Nerd, Notebook, Optimization, Personal, Plugins, Protocol, QEMU, Random, Rant, Repair, S270, Sailing, Samba, Sanquin, Script, Sleep, Software, SSH, Study, Supermicro, Symbols, Tika, Travel, Trivia, USB, Windows, Work, X201, Xanthe, XBee
On my mailserver, I'm using Spamassassin with a Bayes filter to detect spam. Such a filter needs to be trained with samples of spam and ham (non-spam) messages to let it learn what spam and ham looks like, but it also needs to be retrained when the spam or ham changes over time. I have some automatic training set up, but since a while I've seen the bayes filter being completely wrong (showing a confident ham score for something that is very clearly spam), so I decided to retrain the filter from scratch, using the spam and ham messages I collected over the last time (I don't really throw away any e-mail).
Since training with all my e-mail is not productive (more than 5,000
messages aren't really helpful AFAIU, and training with old messages is
not representative for current messages), I decided to just take all of
my e-mail and take the last 2,000 spam and ham messages and train with
that. My spam is neatly collected in 2 mailboxes (Spam for obvious spam
and ProbablySpam for messages that need an occasional review to find
false positives), but my ham is sorted out in dozens of different
mailboxes. Hence, I needed some find
magic to get a list of the most
recent spam and ham messages. So, I built these commands:
# find Spam ProbablySpam -type f \( -path '*/cur/*' -o -path '*/new/*' \) -printf "%T@ %p\n"
| sort -n | cut -d' ' -f 2 | tail -n 2000 > spam
# find . -type d \( -path ./Spam -o -path ./ProbablySpam -o -path ./Bulk -o -path ./Sent \) -prune -o \
-type f \( -path '*/cur/*' -o -path '*/new/*' \) -printf "%T@ %p\n" \
| sort -n | cut -d' ' -f 2 | tail -n 2000 > ham
# sa-learn --progress --spam -f spam
# sa-learn --progress --ham -f ham
After retraining with recent spam, the results were a lot better, so I'm not longer spending time every day deleting a couple dozens spam e-mails :-D