In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the “informativeness ” of a set of blog comments, we construct a language and tokenization independent metric which we call content complexity, providing a normalized answer to the informal question “how much information does this text contain? ” We leverage this metric to create a small set of features well-adjusted to comment spam detection by computing the content complexity over groupings of messages sharing the same author, the same sender IP, the same included links, etc. We evaluate our method against an exact set of tens of millions of comments collected over a four months period and containing a variety of websites, including blogs and news sites. The data was provided to us with an initial spam labeling from an industry competitive source. Nevertheless the initial spam labeling had unknown performance characteristics. To train a logistic regression on this dataset using our features, we derive a simple mislabeling tolerant logistic regression algorithm based on expectationmaximization, which we show generally outperforms the plain version in precision-recall space. By using a parsimonious hand-labeling strategy, we show that our method can operate at an arbitrary high precision level, and that it significantly dominates, both in terms of precision and recall, the original labeling, despite being trained on it alone. The content complexity metric, the use of a noise-tolerant logistic regression and the evaluation methodology are thus the three central contributions with this work
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.