2,399 research outputs found
The Best Answers? Think Twice: Online Detection of Commercial Campaigns in the CQA Forums
In an emerging trend, more and more Internet users search for information
from Community Question and Answer (CQA) websites, as interactive communication
in such websites provides users with a rare feeling of trust. More often than
not, end users look for instant help when they browse the CQA websites for the
best answers. Hence, it is imperative that they should be warned of any
potential commercial campaigns hidden behind the answers. However, existing
research focuses more on the quality of answers and does not meet the above
need. In this paper, we develop a system that automatically analyzes the hidden
patterns of commercial spam and raises alarms instantaneously to end users
whenever a potential commercial campaign is detected. Our detection method
integrates semantic analysis and posters' track records and utilizes the
special features of CQA websites largely different from those in other types of
forums such as microblogs or news reports. Our system is adaptive and
accommodates new evidence uncovered by the detection algorithms over time.
Validated with real-world trace data from a popular Chinese CQA website over a
period of three months, our system shows great potential towards adaptive
online detection of CQA spams.Comment: 9 pages, 10 figure
Deep Learning for User Comment Moderation
Experimenting with a new dataset of 1.6M user comments from a Greek news
portal and existing datasets of English Wikipedia comments, we show that an RNN
outperforms the previous state of the art in moderation. A deep,
classification-specific attention mechanism improves further the overall
performance of the RNN. We also compare against a CNN and a word-list baseline,
considering both fully automatic and semi-automatic moderation
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Addressing the new generation of spam (Spam 2.0) through Web usage models
New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications.The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method.This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering.This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods.This work has coined the term âSpam 2.0â, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem
- âŠ