3 research outputs found
The Best Answers? Think Twice: Online Detection of Commercial Campaigns in the CQA Forums
In an emerging trend, more and more Internet users search for information
from Community Question and Answer (CQA) websites, as interactive communication
in such websites provides users with a rare feeling of trust. More often than
not, end users look for instant help when they browse the CQA websites for the
best answers. Hence, it is imperative that they should be warned of any
potential commercial campaigns hidden behind the answers. However, existing
research focuses more on the quality of answers and does not meet the above
need. In this paper, we develop a system that automatically analyzes the hidden
patterns of commercial spam and raises alarms instantaneously to end users
whenever a potential commercial campaign is detected. Our detection method
integrates semantic analysis and posters' track records and utilizes the
special features of CQA websites largely different from those in other types of
forums such as microblogs or news reports. Our system is adaptive and
accommodates new evidence uncovered by the detection algorithms over time.
Validated with real-world trace data from a popular Chinese CQA website over a
period of three months, our system shows great potential towards adaptive
online detection of CQA spams.Comment: 9 pages, 10 figure
Quality-biased ranking of short texts in microblogging services
Meeting: 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, November 8 - 13, 2011The abundance of user-generated content
comes at a price: the quality of content may
range from very high to very low. We propose
a regression approach that incorporates
various features to recommend short-text documents
from Twitter, with a bias toward quality
perspective. The approach is built on top
of a linear regression model which includes
a regularization factor inspired from the content
conformity hypothesis - documents similar
in content may have similar quality. We
test the system on the Edinburgh Twitter corpus.
Experimental results show that the regularization
factor inspired from the hypothesis
can improve the ranking performance and that
using unlabeled data can make ranking performance
better. Comparative results show that
our method outperforms several baseline systems.
We also make systematic feature analysis
and find that content quality features are
dominant in short-text ranking