Comment moderation that doesn't burn your community

The hard part of comment moderation isn't catching spam. Modern classifiers handle the easy 95%. The hard part is what to do with the borderline cases without grinding your moderators into the ground or turning every approved comment into a debate.

What follows is the playbook we'd want a new site operator to adopt on day one — drawn from running classification at volume across blogs, comment platforms, and forum-style products. Some of it is mechanical (queue routing, threshold choice). Some of it is operational (how moderators stay sharp, how to audit the classifier). Both halves matter; if you only get one right the other one fails.

The three-bucket pattern

Bucket every new comment into one of three statuses on submission:

Publish. Score below your low threshold. Goes live immediately. Reader sees it; the author sees it on the next refresh; nothing else happens.
Queue. Score between the low and high thresholds. Visible to the author (so they don't think their comment was eaten), invisible to the public, surfaced in the moderator's review queue.
Reject. Score above your high threshold. Saved as rejected (don't delete), invisible to everyone including the author.

The reason to use a calibrated probability rather than a binary classifier here is that the buckets are policy choices and you'll want to tune them differently per surface. A developer forum where every comment matters runs conservative thresholds (high low, high high). A marketplace review section where moderator time is the bottleneck runs aggressive thresholds (low low, low high). Same model, two product configurations.

Sensible starting defaults: 0.50 for the low threshold, 0.85 for the high threshold. Most calibrated classifiers will send roughly 80% of comments straight to publish, 15% to queue, and 5% to reject from those numbers. Your queue load is the 15%.

Trusted-author allowlist

Before the classifier runs, check whether the author is on a per-site allowlist. Authors get there by passing a configurable bar: N approved comments without ever being rejected, account age over X days, an explicit moderator promotion. Bypass classification entirely for allowed authors and route them straight to publish.

Two reasons this matters: it saves real money on classification volume (regulars usually generate the bulk of comments), and it keeps the moderator queue focused on actual borderline cases instead of well-known authors who happen to write tersely. The risk — an allowed author turning bad — is mitigated by re-running classification on edit, which we'll get to.

Edge cases that bite

Edits

Re-classify on every edit, no exceptions. The "post clean, edit in spam" pattern is one of the oldest tricks against submit-time-only classification, and it's specifically adapted to bypass platforms that classify only on the initial post. The cost of re-classifying on edit is one API call per edit — for most sites this is rounding error on the volume bill.

Multilingual content

Most production classifiers are trained primarily on English. Non-English comments tend to score artificially high because the model encounters tokens it can't place. Two options: pre-detect language and skip classification for languages the model wasn't trained on (route to queue instead), or raise the thresholds for non-English authors. The first is more defensible; the second is one config line.

Quoted spam

A genuine reply that quotes a spam parent comment will inherit the spam signal — the classifier sees the quoted text and reasonably concludes the whole thing is suspect. Two clean fixes: strip block quotes before classifying, or classify only the new lines added in the reply. The stripped-quote approach is one regex; the new-lines approach requires a diff against the parent and is fiddly enough that it's usually not worth it.

First-comment bias

Brand-new accounts at threshold-adjacent scores deserve more scrutiny than ten-year-old accounts. Stack account age and posting velocity onto the score as a weighted adjustment, not as a hard gate — gating on account age will exclude legitimate first-time commenters who are often the highest-quality contributors.

Queue UX traps

The traps that destroy moderator throughput look small in isolation:

No keyboard dispatch. If approving and rejecting requires aiming a mouse, you've cut moderator throughput by an order of magnitude. Single-key dispatch (J/K to navigate, A to approve, R to reject) is the floor.
Sort order matters. Sort the queue by descending probability — the most ambiguous-looking comments first. Moderators are sharpest in the first twenty minutes of a session; spend their attention on the borderline cases, not the obvious ones.
Show the score, not just the binary. A comment at 0.52 and a comment at 0.84 are both "queued" but want different reviews. Surface the probability as a badge (we use red/amber/grey).
Bulk-approve by author. Once a moderator has approved one comment from an author in a session, give them a one-click "approve all from this author" for the queue. You're trusting human judgment to short-circuit machine confidence.

Auditing the classifier

A spam classifier is a moving target. Spam techniques evolve, your audience evolves, and the classifier vendor ships new model versions. Three audit habits keep you out of trouble:

Sample false positives weekly. Pull a handful of high-score comments at random and have a human re-judge them. False-positive rate trending up is the earliest signal that thresholds want lowering.
Don't delete rejected comments. Soft-delete with the score and timestamp persisted. The audit sample needs the actual text. So does the support reply when a commenter writes in saying their post was eaten.
Track score histograms over time. Plot the score distribution per week. A sudden bimodal distribution where there used to be a long tail usually means the classifier got an update; a sudden right-shift usually means a spam wave. Either is worth knowing before a moderator notices through the queue.

What you don't need

A few moderation patterns that get recommended a lot but consistently underperform calibrated classification: pure regex-based blocklists (every blocklist becomes a maintenance graveyard), pure "first comment moderated forever" gates (kills first-time-commenter quality), honeypot fields as a primary defence (good as a layer, weak alone — see the honeypot post), and reCAPTCHA on the comment form (genuine UX tax for marginal benefit on a problem that classification handles server-side).

The shorter version

Three buckets, two thresholds, an allowlist for trusted authors, re-classify on edit, and a moderator queue UX that respects single-keystroke dispatch. Add a weekly false-positive audit and a score histogram to keep the classifier honest. Most of the difficulty in comment moderation is operational, not technical — the model is the smaller half of the problem.