674 research outputs found
Classifying sentiment in microblogs: is brevity an advantage?
Microblogs as a new textual domain offer a unique proposition for sentiment analysis. Their short document length suggests any sentiment they contain is compact and explicit. However, this short length coupled with their noisy nature can pose difficulties for standard machine learning document representations. In this work we examine the hypothesis that it is easier to classify the sentiment in these short form documents than in longer form documents. Surprisingly, we find classifying sentiment in microblogs easier than in blogs and make a number of observations pertaining to the challenge of supervised learning for sentiment analysis in microblogs
A Vertical PRF Architecture for Microblog Search
In microblog retrieval, query expansion can be essential to obtain good
search results due to the short size of queries and posts. Since information in
microblogs is highly dynamic, an up-to-date index coupled with pseudo-relevance
feedback (PRF) with an external corpus has a higher chance of retrieving more
relevant documents and improving ranking. In this paper, we focus on the
research question:how can we reduce the query expansion computational cost
while maintaining the same retrieval precision as standard PRF? Therefore, we
propose to accelerate the query expansion step of pseudo-relevance feedback.
The hypothesis is that using an expansion corpus organized into verticals for
expanding the query, will lead to a more efficient query expansion process and
improved retrieval effectiveness. Thus, the proposed query expansion method
uses a distributed search architecture and resource selection algorithms to
provide an efficient query expansion process. Experiments on the TREC Microblog
datasets show that the proposed approach can match or outperform standard PRF
in MAP and NDCG@30, with a computational cost that is three orders of magnitude
lower.Comment: To appear in ICTIR 201
FooTweets: a bilingual parallel corpus of World Cup tweets
The way information spreads through society has changed significantly over the past decade with the advent of online social networking.
Twitter, one of the most widely used social networking websites, is known as the real-time, public microblogging network where news
breaks first. Most users love it for its iconic 140-character limitation and unfiltered feed that show them news and opinions in the
form of tweets. Tweets are usually multilingual in nature and of varying quality. However, machine translation (MT) of twitter data
is a challenging task especially due to the following two reasons: (i) tweets are informal in nature (i.e., violates linguistic norms), and
(ii) parallel resource for twitter data is scarcely available on the Internet. In this paper, we develop FooTweets, a first parallel corpus of
tweets for English–German language pair. We extract 4, 000 English tweets from the FIFA 2014 world cup and manually translate them
into German with a special focus on the informal nature of the tweets. In addition to this, we also annotate sentiment scores between 0
and 1 to all the tweets depending upon the degree of sentiment associated with them. This data has recently been used to build sentiment
translation engines and an extensive evaluation revealed that such a resource is very useful in machine translation of user generated
content
Recommended from our members
Verifying baselines for crisis event information classification on Twitter
Social media are rich information sources during and in the aftermath of crisis events such as earthquakes and terrorist attacks. Despite myriad challenges, with the right tools, significant insight can be gained which can assist emergency responders and related applications. However, most extant approaches are incomparable, using bespoke definitions, models, datasets and even evaluation metrics. Furthermore, it is rare that code, trained models, or exhaustive parametrisation details are made openly available. Thus, even confirmation of self-reported performance is problematic; authoritatively determining the state of the art (SOTA) is essentially impossible. Consequently, to begin addressing such endemic ambiguity, this paper seeks to make 3 contributions: 1) the replication and results confirmation of a leading (and generalisable) technique; 2) testing straightforward modifications of the technique likely to improve performance; and 3) the extension of the technique to a novel and complimentary type of crisis-relevant information to demonstrate it’s generalisability
- …