4,684 research outputs found
Identifying Expert Reviews in the Crowd: Linking Curated and Noisy Domains
Over the past decade, vast number of online consumer reviews have made a
significant presence on the Internet. These reviews play a vital role in consumer
awareness about the products and deeply impact the consumer's decision-making
process. On one hand, websites like Amazon, Yelp provide huge collections of crowd-
sourced reviews, which are written by consumers themselves having experience in
using that product. Many researchers argue about the credibility and bias of these
reviews. These factors, coupled with the sheer plethora of reviews for each product,
it can become tiring to form a perspective about the product. On other hand,
websites like Wirecutter, Thesweetsetup provide hand-made highly curated detailed
guides on products across various categories. Although these reviews are unbiased
expert opinions, they require vigorous reporting, interviewing, and testing by various
journalists, scientists, and researchers. Thus making them hard to scale.
Our aim is to study the possible correlations between the crowd-sourced noisy
domain reviews and the curated reviews. We take into account meta-features of re-
views, context-based textual features of reviews and word-embedding based features
of words from reviews. In addition to this, we identify “good reviews", defined as
those noisy domain reviews that align with the curated ones, and use this to propose
a general purpose, extremely streamlined recommender that can provide value to the
general public without any personalized inputs. This research will contribute significantly towards identifying unbiased crowd-sourced reviews that align with curated
reviews, across different categories of products, thereby linking the curated and noisy
domains. Our research will also contribute significantly towards understanding the
intricacies of good product reviews across different categories
Profiling user activities with minimal traffic traces
Understanding user behavior is essential to personalize and enrich a user's
online experience. While there are significant benefits to be accrued from the
pursuit of personalized services based on a fine-grained behavioral analysis,
care must be taken to address user privacy concerns. In this paper, we consider
the use of web traces with truncated URLs - each URL is trimmed to only contain
the web domain - for this purpose. While such truncation removes the
fine-grained sensitive information, it also strips the data of many features
that are crucial to the profiling of user activity. We show how to overcome the
severe handicap of lack of crucial features for the purpose of filtering out
the URLs representing a user activity from the noisy network traffic trace
(including advertisement, spam, analytics, webscripts) with high accuracy. This
activity profiling with truncated URLs enables the network operators to provide
personalized services while mitigating privacy concerns by storing and sharing
only truncated traffic traces.
In order to offset the accuracy loss due to truncation, our statistical
methodology leverages specialized features extracted from a group of
consecutive URLs that represent a micro user action like web click, chat reply,
etc., which we call bursts. These bursts, in turn, are detected by a novel
algorithm which is based on our observed characteristics of the inter-arrival
time of HTTP records. We present an extensive experimental evaluation on a real
dataset of mobile web traces, consisting of more than 130 million records,
representing the browsing activities of 10,000 users over a period of 30 days.
Our results show that the proposed methodology achieves around 90% accuracy in
segregating URLs representing user activities from non-representative URLs
Leveraging online user feedback to improve statistical machine translation
In this article we present a three-step methodology for dynamically improving a statistical machine translation (SMT) system by incorporating human feedback in the form of free edits on the system translations. We target at feedback provided by casual users, which is typically error-prone. Thus, we first propose a filtering step to automatically identify the better user-edited translations and discard the useless ones. A second step produces a pivot-based alignment between source and user-edited sentences, focusing on the errors made by the system. Finally, a third step produces a new translation model and combines it linearly with the one from the original system. We perform a thorough evaluation on a real-world dataset collected from the Reverso.net translation service and show that every step in our methodology contributes significantly to improve a general purpose SMT system. Interestingly, the quality improvement is not only due to the increase of lexical coverage, but to a better lexical selection, reordering, and morphology. Finally, we show the robustness of the methodology by applying it to a different scenario, in which the new examples come from an automatically Web-crawled parallel corpus. Using exactly the same architecture and models provides again a significant improvement of the translation quality of a general purpose baseline SMT system
Analysing Errors of Open Information Extraction Systems
We report results on benchmarking Open Information Extraction (OIE) systems
using RelVis, a toolkit for benchmarking Open Information Extraction systems.
Our comprehensive benchmark contains three data sets from the news domain and
one data set from Wikipedia with overall 4522 labeled sentences and 11243
binary or n-ary OIE relations. In our analysis on these data sets we compared
the performance of four popular OIE systems, ClausIE, OpenIE 4.2, Stanford
OpenIE and PredPatt. In addition, we evaluated the impact of five common error
classes on a subset of 749 n-ary tuples. From our deep analysis we unreveal
important research directions for a next generation of OIE systems.Comment: Accepted at Building Linguistically Generalizable NLP Systems at
EMNLP 201
Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling
Spambot detection in online social networks is a long-lasting challenge
involving the study and design of detection techniques capable of efficiently
identifying ever-evolving spammers. Recently, a new wave of social spambots has
emerged, with advanced human-like characteristics that allow them to go
undetected even by current state-of-the-art algorithms. In this paper, we show
that efficient spambots detection can be achieved via an in-depth analysis of
their collective behaviors exploiting the digital DNA technique for modeling
the behaviors of social network users. Inspired by its biological counterpart,
in the digital DNA representation the behavioral lifetime of a digital account
is encoded in a sequence of characters. Then, we define a similarity measure
for such digital DNA sequences. We build upon digital DNA and the similarity
between groups of users to characterize both genuine accounts and spambots.
Leveraging such characterization, we design the Social Fingerprinting
technique, which is able to discriminate among spambots and genuine accounts in
both a supervised and an unsupervised fashion. We finally evaluate the
effectiveness of Social Fingerprinting and we compare it with three
state-of-the-art detection algorithms. Among the peculiarities of our approach
is the possibility to apply off-the-shelf DNA analysis techniques to study
online users behaviors and to efficiently rely on a limited number of
lightweight account characteristics
- …