25,270 research outputs found
The Early Bird Catches The Term: Combining Twitter and News Data For Event Detection and Situational Awareness
Twitter updates now represent an enormous stream of information originating
from a wide variety of formal and informal sources, much of which is relevant
to real-world events. In this paper we adapt existing bio-surveillance
algorithms to detect localised spikes in Twitter activity corresponding to real
events with a high level of confidence. We then develop a methodology to
automatically summarise these events, both by providing the tweets which fully
describe the event and by linking to highly relevant news articles. We apply
our methods to outbreaks of illness and events strongly affecting sentiment. In
both case studies we are able to detect events verifiable by third party
sources and produce high quality summaries
An Email Attachment is Worth a Thousand Words, or Is It?
There is an extensive body of research on Social Network Analysis (SNA) based
on the email archive. The network used in the analysis is generally extracted
either by capturing the email communication in From, To, Cc and Bcc email
header fields or by the entities contained in the email message. In the latter
case, the entities could be, for instance, the bag of words, url's, names,
phones, etc. It could also include the textual content of attachments, for
instance Microsoft Word documents, excel spreadsheets, or Adobe pdfs. The nodes
in this network represent users and entities. The edges represent communication
between users and relations to the entities. We suggest taking a different
approach to the network extraction and use attachments shared between users as
the edges. The motivation for this is two-fold. First, attachments represent
the "intimacy" manifestation of the relation's strength. Second, the
statistical analysis of private email archives that we collected and Enron
email corpus shows that the attachments contribute in average around 80-90% to
the archive's disk-space usage, which means that most of the data is presently
ignored in the SNA of email archives. Consequently, we hypothesize that this
approach might provide more insight into the social structure of the email
archive. We extract the communication and shared attachments networks from
Enron email corpus. We further analyze degree, betweenness, closeness, and
eigenvector centrality measures in both networks and review the differences and
what can be learned from them. We use nearest neighbor algorithm to generate
similarity groups for five Enron employees. The groups are consistent with
Enron's organizational chart, which validates our approach.Comment: 12 pages, 4 figures, 7 tables, IML'17, Liverpool, U
EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets
This article introduces a new language-independent approach for creating a
large-scale high-quality test collection of tweets that supports multiple
information retrieval (IR) tasks without running a shared-task campaign. The
adopted approach (demonstrated over Arabic tweets) designs the collection
around significant (i.e., popular) events, which enables the development of
topics that represent frequent information needs of Twitter users for which
rich content exists. That inherently facilitates the support of multiple tasks
that generally revolve around events, namely event detection, ad-hoc search,
timeline generation, and real-time summarization. The key highlights of the
approach include diversifying the judgment pool via interactive search and
multiple manually-crafted queries per topic, collecting high-quality
annotations via crowd-workers for relevancy and in-house annotators for
novelty, filtering out low-agreement topics and inaccessible tweets, and
providing multiple subsets of the collection for better availability. Applying
our methodology on Arabic tweets resulted in EveTAR , the first
freely-available tweet test collection for multiple IR tasks. EveTAR includes a
crawl of 355M Arabic tweets and covers 50 significant events for which about
62K tweets were judged with substantial average inter-annotator agreement
(Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating
existing algorithms in the respective tasks. Results indicate that the new
collection can support reliable ranking of IR systems that is comparable to
similar TREC collections, while providing strong baseline results for future
studies over Arabic tweets
- …