Skip to main content
Article thumbnail
Location of Repository

Robust detection of comment spam using entropy rate

By Alex Kantchelian, Sadia Afroz, Justin Ma, Anthony D. Joseph, Ling Huang and J. D. Tygar

Abstract

In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the “informativeness ” of a set of blog comments, we construct a language and tokenization independent metric which we call content complexity, providing a normalized answer to the informal question “how much information does this text contain? ” We leverage this metric to create a small set of features well-adjusted to comment spam detection by computing the content complexity over groupings of messages sharing the same author, the same sender IP, the same included links, etc. We evaluate our method against an exact set of tens of millions of comments collected over a four months period and containing a variety of websites, including blogs and news sites. The data was provided to us with an initial spam labeling from an industry competitive source. Nevertheless the initial spam labeling had unknown performance characteristics. To train a logistic regression on this dataset using our features, we derive a simple mislabeling tolerant logistic regression algorithm based on expectationmaximization, which we show generally outperforms the plain version in precision-recall space. By using a parsimonious hand-labeling strategy, we show that our method can operate at an arbitrary high precision level, and that it significantly dominates, both in terms of precision and recall, the original labeling, despite being trained on it alone. The content complexity metric, the use of a noise-tolerant logistic regression and the evaluation methodology are thus the three central contributions with this work

Topics: I.2.6 [Artificial Intelligence, Learning, H.3.3 [Information Storage and Retrieval
Year: 2012
OAI identifier: oai:CiteSeerX.psu:10.1.1.352.3044
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www2.berkeley.intel-res... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.