44,122 research outputs found
The Early Bird Catches The Term: Combining Twitter and News Data For Event Detection and Situational Awareness
Twitter updates now represent an enormous stream of information originating
from a wide variety of formal and informal sources, much of which is relevant
to real-world events. In this paper we adapt existing bio-surveillance
algorithms to detect localised spikes in Twitter activity corresponding to real
events with a high level of confidence. We then develop a methodology to
automatically summarise these events, both by providing the tweets which fully
describe the event and by linking to highly relevant news articles. We apply
our methods to outbreaks of illness and events strongly affecting sentiment. In
both case studies we are able to detect events verifiable by third party
sources and produce high quality summaries
The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity
News articles covering policy issues are an essential source of information
in the social sciences and are also frequently used for other use cases, e.g.,
to train NLP language models. To derive meaningful insights from the analysis
of news, large datasets are required that represent real-world distributions,
e.g., with respect to the contained outlets' popularity, topically, or across
time. Information on the political leanings of media publishers is often
needed, e.g., to study differences in news reporting across the political
spectrum, which is one of the prime use cases in the social sciences when
studying media bias and related societal issues. Concerning these requirements,
existing datasets have major flaws, resulting in redundant and cumbersome
effort in the research community for dataset creation. To fill this gap, we
present POLUSA, a dataset that represents the online media landscape as
perceived by an average US news consumer. The dataset contains 0.9M articles
covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news
outlets representing the political spectrum. Each outlet is labeled by its
political leaning, which we derive using a systematic aggregation of eight data
sources. The news dataset is balanced with respect to publication date and
outlet popularity. POLUSA enables studying a variety of subjects, e.g., media
effects and political partisanship. Due to its size, the dataset allows to
utilize data-intense deep learning methods.Comment: 2 pages, 1 tabl
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Topic-dependent sentiment analysis of financial blogs
While most work in sentiment analysis in the financial domain has focused on the use of content from traditional finance news, in this work we concentrate on more subjective sources of information, blogs. We aim to automatically determine the sentiment of financial bloggers towards companies and their stocks. To do this we develop a corpus of financial blogs, annotated with polarity of sentiment with respect to a number of companies. We conduct an analysis of the annotated corpus, from which we show there is a significant level of topic shift within this collection, and also illustrate the difficulty that human annotators have when annotating certain sentiment categories. To deal with the problem of topic shift within blog articles, we propose text extraction techniques to create topic-specific sub-documents, which we use to train a sentiment classifier. We show that such approaches provide a substantial improvement over full documentclassification and that word-based approaches perform better than sentence-based or paragraph-based approaches
- …