The hard part of comment moderation isn't catching spam. Modern classifiers handle the easy 95%. The hard part is what to do with the borderline cases without grinding your moderators into the ground or turning every approved comment into a debate.
What follows is the playbook we'd want a new site operator to adopt on day one — drawn from running classification at volume across blogs, comment platforms, and forum-style products. Some of it is mechanical (queue routing, threshold choice). Some of it is operational (how moderators stay sharp, how to audit the classifier). Both halves matter; if you only get one right the other one fails.
The three-bucket pattern
Bucket every new comment into one of three statuses on submission:
- Publish. Score below your low threshold. Goes live immediately. Reader sees it; the author sees it on the next refresh; nothing else happens.
- Queue. Score between the low and high thresholds. Visible to the author (so they don't think their comment was eaten), invisible to the public, surfaced in the moderator's review queue.
- Reject. Score above your high threshold. Saved as rejected (don't delete), invisible to everyone including the author.
The reason to use a calibrated probability rather than a binary classifier here is that the buckets are policy choices and you'll want to tune them differently per surface. A developer forum where every comment matters runs conservative thresholds (high low, high high). A marketplace review section where moderator time is the bottleneck runs aggressive thresholds (low low, low high). Same model, two product configurations.
Sensible starting defaults: 0.50 for the low threshold, 0.85 for the high threshold. Most calibrated classifiers will send roughly 80% of comments straight to publish, 15% to queue, and 5% to reject from those numbers. Your queue load is the 15%.
Trusted-author allowlist
Before the classifier runs, check whether the author is on a per-site allowlist. Authors get there by passing a configurable bar: N approved comments without ever being rejected, account age over X days, an explicit moderator promotion. Bypass classification entirely for allowed authors and route them straight to publish.
Two reasons this matters: it saves real money on classification volume (regulars usually generate the bulk of comments), and it keeps the moderator queue focused on actual borderline cases instead of well-known authors who happen to write tersely. The risk — an allowed author turning bad — is mitigated by re-running classification on edit, which we'll get to.
Edge cases that bite
Edits
Re-classify on every edit, no exceptions. The "post clean, edit in spam" pattern is one of the oldest tricks against submit-time-only classification, and it's specifically adapted to bypass platforms that classify only on the initial post. The cost of re-classifying on edit is one API call per edit — for most sites this is rounding error on the volume bill.
Multilingual content
Most production classifiers are trained primarily on English. Non-English comments tend to score artificially high because the model encounters tokens it can't place. Two options: pre-detect language and skip classification for languages the model wasn't trained on (route to queue instead), or raise the thresholds for non-English authors. The first is more defensible; the second is one config line.
Quoted spam
A genuine reply that quotes a spam parent comment will inherit the spam signal — the classifier sees the quoted text and reasonably concludes the whole thing is suspect. Two clean fixes: strip block quotes before classifying, or classify only the new lines added in the reply. The stripped-quote approach is one regex; the new-lines approach requires a diff against the parent and is fiddly enough that it's usually not worth it.
First-comment bias
Brand-new accounts at threshold-adjacent scores deserve more scrutiny than ten-year-old accounts. Stack account age and posting velocity onto the score as a weighted adjustment, not as a hard gate — gating on account age will exclude legitimate first-time commenters who are often the highest-quality contributors.
Queue UX traps
The traps that destroy moderator throughput look small in isolation:
- No keyboard dispatch. If approving and rejecting requires aiming a mouse, you've cut moderator throughput by an order of magnitude. Single-key dispatch (J/K to navigate, A to approve, R to reject) is the floor.
- Sort order matters. Sort the queue by descending probability — the most ambiguous-looking comments first. Moderators are sharpest in the first twenty minutes of a session; spend their attention on the borderline cases, not the obvious ones.
- Show the score, not just the binary. A comment at 0.52 and a comment at 0.84 are both "queued" but want different reviews. Surface the probability as a badge (we use red/amber/grey).
- Bulk-approve by author. Once a moderator has approved one comment from an author in a session, give them a one-click "approve all from this author" for the queue. You're trusting human judgment to short-circuit machine confidence.
Auditing the classifier
A spam classifier is a moving target. Spam techniques evolve, your audience evolves, and the classifier vendor ships new model versions. Three audit habits keep you out of trouble:
- Sample false positives weekly. Pull a handful of high-score comments at random and have a human re-judge them. False-positive rate trending up is the earliest signal that thresholds want lowering.
- Don't delete rejected comments. Soft-delete with the score and timestamp persisted. The audit sample needs the actual text. So does the support reply when a commenter writes in saying their post was eaten.
- Track score histograms over time. Plot the score distribution per week. A sudden bimodal distribution where there used to be a long tail usually means the classifier got an update; a sudden right-shift usually means a spam wave. Either is worth knowing before a moderator notices through the queue.
What you don't need
A few moderation patterns that get recommended a lot but consistently underperform calibrated classification: pure regex-based blocklists (every blocklist becomes a maintenance graveyard), pure "first comment moderated forever" gates (kills first-time-commenter quality), honeypot fields as a primary defence (good as a layer, weak alone — see the honeypot post), and reCAPTCHA on the comment form (genuine UX tax for marginal benefit on a problem that classification handles server-side).
The shorter version
Three buckets, two thresholds, an allowlist for trusted authors, re-classify on edit, and a moderator queue UX that respects single-keystroke dispatch. Add a weekly false-positive audit and a score histogram to keep the classifier honest. Most of the difficulty in comment moderation is operational, not technical — the model is the smaller half of the problem.