Spam Filtering

by
in pipedot on (#2WNE)
Recently, Soylent News discussed adding more labels to the moderation system. Although opinions on "Disagree" and "Factually Incorrect" may still be varied, nearly everyone supported the addition of a "Spam" label.

For Pipedot, we've gone ahead and added the later. Moderating a comment as "Spam" will decrease its score by one and flag it for further review by an editor. This way, normal users can greatly help the editors identify junk comments.

Once an editor marks a comment as spam, the message will be "hidden" one step deeper than the normal "Hide Threshold" slider setting. However, comments are never deleted. If you want to continue to see all comments, including the spam, click the "Show Junk Comments" checkbox on your profile settings page. Similar to the current blue (new) and gray (seen) rendering, the title bar of junk comments will be colored red to easily differentiate them from the good stuff.

Re: Editor Question (Score: 1)

by zocalo@pipedot.org on 2015-01-06 09:22 (#2WP0)

The sample posted earlier was the only one I'd ever seen, so I was quite surprised about the scale of the problem. Having it spammed into old threads would explain that though, which is possibly one reason why Slashdot archives older discussions. You're right about the pain of having stuff dropped into a submission queue though, and simply blocking common spam terms like "viagra" and the like is obviously going to give many false positives on a site that might discuss them, and will probably have them used in humourous comments elsewhere.

Getting back to the regexps, it's hard to say what (if anything) would work for Pipedot without a good overview of the crap being submitted, but one general technique that does seem like it would work well for typical forum spam (including your example) is to trigger off excessive use of certain punctuation marks, particularly in subjects - commas and hyphens seem well liked by many forum spammers; the one in your example put four in there. Ideally you'd probably also want to have a requirement that multiple rules match before a post goes into the moderation queue, or even a basic scoring system like SpamAssassin et al use, but based on the comments above that's probably overkill - at least at present. Ultimately though it's still an arms race, and the spammers will adapt as soon as they realise they are being blocked; sometimes you just have to go for the easy stuff and accept that the rest might need manual handling later.
Post Comment
Subject
Comment
Captcha
What is the 4th digit in 8245535?