19 research outputs found
Building a Test Collection for Significant-Event Detection in Arabic Tweets
With the increasing popularity of microblogging services like Twitter, researchers discov-
ered a rich medium for tackling real-life problems like event detection. However, event
detection in Twitter is often obstructed by the lack of public evaluation mechanisms
such as test collections (set of tweets, labels, and queries to measure the eectiveness of
an information retrieval system). The problem is more evident when non-English lan-
guages, e.g., Arabic, are concerned. With the recent surge of signicant events in the
Arab world, news agencies and decision makers rely on Twitters microblogging service to
obtain recent information on events. In this thesis, we address the problem of building a
test collection of Arabic tweets (named EveTAR) for the task of event detection.
To build EveTAR, we rst adopted an adequate denition of an event, which is a
signicant occurrence that takes place at a certain time. An occurrence is signicant if
there are news articles about it. We collected Arabic tweets using Twitter's streaming
API. Then, we identied a set of events from the Arabic data collection using Wikipedias
current events portal. Corresponding tweets were extracted by querying the Arabic data
collection with a set of manually-constructed queries. To obtain relevance judgments for
those tweets, we leveraged CrowdFlower's crowdsourcing platform.
Over a period of 4 weeks, we crawled over 590M tweets, from which we identied 66
events that cover 8 dierent categories and gathered more than 134k relevance judgments.
Each event contains an average of 779 relevant tweets. Over all events, we got an average
Kappa of 0.6, which is a substantially acceptable value. EveTAR was used to evalu-
ate three state-of-the-art event detection algorithms. The best performing algorithms
achieved 0.60 in F1 measure and 0.80 in both precision and recall. We plan to make
our test collection available for research, including events description, manually-crafted
queries to extract potentially-relevant tweets, and all judgments per tweet. EveTAR is
the rst Arabic test collection built from scratch for the task of event detection. Addi-
tionally, we show in our experiments that it supports other tasks like ad-hoc search
Modelling Social Media Popularity of News Articles Using Headline Text
The way we formulate headlines matters -- this is the central tenet of this thesis.
Headlines play a key role in attracting and engaging online audiences. With the increasing usage of mobile apps and social media to consume news, headlines are the most prominent -- and often the only -- part of the news article visible to readers. Earlier studies examined how readers' preferences and their social network influence which headlines are clicked or shared on social media. However, there is limited research on the impact of the headline text on social media popularity.
To address this research gap we pose the following question: how to formulate a headline so that it reaches as many readers as possible on social media. To answer this question we adopt an experimental approach to model and predict the popularity of news articles on social media using headlines. First, we develop computational methods for an automatic extraction of two types of headline characteristics. The first type is news values: Prominence, Sentiment, Magnitude, Proximity, Surprise, and Uniqueness. The second type is linguistic style: Brevity, Simplicity, Unambiguity, Punctuation, Nouns, Verbs, and Adverbs. We then investigate the impact of these features on popularity using social media popularity on Twitter and Facebook, and perceived popularity obtained from a crowdsourced survey. Finally, using these features and headline metadata we build prediction models for global and country-specific social media popularity. For the country-specific prediction model we augment several news values features with country relatedness information using knowledge graphs.
Our research established that computational methods can be reliably used to characterise headlines in terms of news values and linguistic style features; and that most of these features significantly correlate with social media popularity and to a lesser extent with perceived popularity. Our prediction model for global social media popularity outperformed state-of-the-art baselines, showing that headline wording has an effect on social media popularity. With the country-specific prediction model we showed that we improved the features implementations by adding data from knowledge graphs.
These findings indicate that formulating a headline in a certain way can lead to wider readership engagement. Furthermore, our methods can be applied to other types of digital content similar to headlines, such as titles for blog posts or videos. More broadly our results signify the importance of content analysis for popularity prediction
Detecting New, Informative Propositions in Social Media
The ever growing quantity of online text produced makes it increasingly challenging to find new important or useful information. This is especially so when topics of potential interest are not known a-priori, such as in “breaking news stories”. This thesis examines techniques for detecting the emergence of new, interesting information in Social Media. It sets the investigation in the context of a hypothetical knowledge discovery and acquisition system, and addresses two objectives. The first objective addressed is the detection of new topics. The second is filtering of non-informative text from Social Media. A rolling time-slicing approach is proposed for discovery, in which daily frequencies of nouns, named entities, and multiword expressions are compared to their expected daily frequencies, as estimated from previous days using a Poisson model. Trending features, those showing a significant surge in use, in Social Media are potentially interesting. Features that have not shown a similar recent surge in News are selected as indicative of new information. It is demonstrated that surges in nouns and news entities can be detected that predict corresponding surges in mainstream news. Co-occurring trending features are used to create clusters of potentially topic-related documents. Those formed from co-occurrences of named entities are shown to be the most topically coherent.
Machine learning based filtering models are proposed for finding informative text in Social Media. News/Non-News and Dialogue Act models are explored using the News annotated Redites corpus of Twitter messages. A simple 5-act Dialogue scheme, used to annotate a small sample thereof, is presented. For both News/Non-News and Informative/Non-Informative classification tasks, using non-lexical message features produces more discriminative and robust classification models than using message terms alone. The
combination of all investigated features yield the most accurate models
Real-time event detection using Twitter
Twitter has become the social network of news and journalism. Monitoring what is said on Twitter is a frequent task for anyone who requires timely access to information: journalists, traders, and the emergency services have all invested heavily in monitoring Twitter in recent years. Given this, there is a need to develop systems that can automatically monitor Twitter to detect real-world events as they happen, and alert users to novel events. However, this is not an easy task due to the noise and volume of data that is produced from social media streams such as Twitter. Although a range of approaches have been developed, many are unevaluated, cannot scale past low volume streams, or can only detect specific types of event.
In this thesis, we develop novel approaches to event detection, and enable the evaluation and comparison of event detection approaches by creating a large-scale test collection called Events 2012, containing 120 million tweets and with relevance judgements for over 500 events. We use existing event detection approaches and Wikipedia to generate candidate events, then use crowdsourcing to gather annotations.
We propose a novel entity-based, real-time, event detection approach that we evaluate using the Events 2012 collection, and show that it outperforms existing state-of-the-art approaches to event detection whilst also being scalable. We examine and compare automated and crowdsourced evaluation methodologies for the evaluation of event detection.
Finally, we propose a Newsworthiness score that is learned in real-time from heuristically labelled data. The score is able to accurately classify individual tweets as newsworthy or noise in real-time. We adapt the score for use as a feature for event detection, and find that it can easily be used to filter out noisy clusters and improve existing event detection techniques.
We conclude with a summary of our research findings and answers to our research questions. We discuss some of the difficulties that remain to be solved in event detection on Twitter and propose some possible future directions for research into real-time event detection on Twitter
Recommended from our members
Semantic Sentiment Analysis of Microblogs
Microblogs and social media platforms are now considered among the most popular forms of online communication. Through a platform like Twitter, much information reflecting people's opinions and attitudes is published and shared among users on a daily basis. This has recently brought great opportunities to companies interested in tracking and monitoring the reputation of their brands and businesses, and to policy makers and politicians to support their assessment of public opinions about their policies or political issues.
A wide range of approaches to sentiment analysis on Twitter, and other similar microblogging platforms, have been recently built. Most of these approaches rely mainly on the presence of affect words or syntactic structures that explicitly and unambiguously reflect sentiment (e.g., "great'', "terrible''). However, these approaches are semantically weak, that is, they do not account for the semantics of words when detecting their sentiment in text. This is problematic since the sentiment of words, in many cases, is associated with their semantics, either along the context they occur within (e.g., "great'' is negative in the context "pain'') or the conceptual meaning associated with the words (e.g., "Ebola" is negative when its associated semantic concept is "Virus").
This thesis investigates the role of words' semantics in sentiment analysis of microblogs, aiming mainly at addressing the above problem. In particular, Twitter is used as a case study of microblogging platforms to investigate whether capturing the sentiment of words with respect to their semantics leads to more accurate sentiment analysis models on Twitter. To this end, several approaches are proposed in this thesis for extracting and incorporating two types of word semantics for sentiment analysis: contextual semantics (i.e., semantics captured from words' co-occurrences) and conceptual semantics (i.e., semantics extracted from external knowledge sources).
Experiments are conducted with both types of semantics by assessing their impact in three popular sentiment analysis tasks on Twitter; entity-level sentiment analysis, tweet-level sentiment analysis and context-sensitive sentiment lexicon adaptation. Evaluation under each sentiment analysis task includes several sentiment lexicons, and up to 9 Twitter datasets of different characteristics, as well as comparing against several state-of-the-art sentiment analysis approaches widely used in the literature.
The findings from this body of work demonstrate the value of using semantics in sentiment analysis on Twitter. The proposed approaches, which consider words' semantics for sentiment analysis at both, entity and tweet levels, surpass non-semantic approaches in most datasets
Cartoons as interdiscourse : a quali-quantitative analysis of social representations based on collective imagination in cartoons produced after the Charlie Hebdo attack
The attacks against Charlie Hebdo in Paris at the beginning of the year 2015 urged many cartoonists – most professionals but some laymen as well – to create cartoons as a reaction to this tragedy. The main goal of this article is to show how traumatic events like this one can converge in a rather limited set of metaphors, ranging from easily recognizable topoi to rather vague interdiscourses that circulate in contemporary societies. To do so, we analyzed 450 cartoons that were produced as a reaction to the Charlie Hebdo attacks, and took a quali-quantitative approach that draws both on discourse analysis and semiotics. In this paper, we identified eight main themes and we analyzed the five ones which are anchored in collective imagination (the pen against the sword, the journalist as a modern hero, etc.). Then, we studied the cartoons at figurative, narrative and thematic levels thanks to Greimas’ model of the semiotic square. This paper shows the ways in which these cartoons build upon a memory-based network of events from the recent past (particularly 9/11), and more generally on a collective imagination which can be linked to Western values.SCOPUS: ar.jinfo:eu-repo/semantics/publishe
Bioinformatics
This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here