27,771 research outputs found
Accurate Local Estimation of Geo-Coordinates for Social Media Posts
Associating geo-coordinates with the content of social media posts can
enhance many existing applications and services and enable a host of new ones.
Unfortunately, a majority of social media posts are not tagged with
geo-coordinates. Even when location data is available, it may be inaccurate,
very broad or sometimes fictitious. Contemporary location estimation approaches
based on analyzing the content of these posts can identify only broad areas
such as a city, which limits their usefulness. To address these shortcomings,
this paper proposes a methodology to narrowly estimate the geo-coordinates of
social media posts with high accuracy. The methodology relies solely on the
content of these posts and prior knowledge of the wide geographical region from
where the posts originate. An ensemble of language models, which are smoothed
over non-overlapping sub-regions of a wider region, lie at the heart of the
methodology. Experimental evaluation using a corpus of over half a million
tweets from New York City shows that the approach, on an average, estimates
locations of tweets to within just 2.15km of their actual positions.Comment: In Proceedings of the 26th International Conference on Software
Engineering and Knowledge Engineering, pp. 642 - 647, 201
Bank Networks from Text: Interrelations, Centrality and Determinants
In the wake of the still ongoing global financial crisis, bank
interdependencies have come into focus in trying to assess linkages among banks
and systemic risk. To date, such analysis has largely been based on numerical
data. By contrast, this study attempts to gain further insight into bank
interconnections by tapping into financial discourse. We present a
text-to-network process, which has its basis in co-occurrences of bank names
and can be analyzed quantitatively and visualized. To quantify bank importance,
we propose an information centrality measure to rank and assess trends of bank
centrality in discussion. For qualitative assessment of bank networks, we put
forward a visual, interactive interface for better illustrating network
structures. We illustrate the text-based approach on European Large and Complex
Banking Groups (LCBGs) during the ongoing financial crisis by quantifying bank
interrelations and centrality from discussion in 3M news articles, spanning
2007Q1 to 2014Q3.Comment: Quantitative Finance, forthcoming in 201
Curriculum Guidelines for Undergraduate Programs in Data Science
The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program
met for the purpose of composing guidelines for undergraduate programs in Data
Science. The group consisted of 25 undergraduate faculty from a variety of
institutions in the U.S., primarily from the disciplines of mathematics,
statistics and computer science. These guidelines are meant to provide some
structure for institutions planning for or revising a major in Data Science
Dating Texts without Explicit Temporal Cues
This paper tackles temporal resolution of documents, such as determining when
a document is about or when it was written, based only on its text. We apply
techniques from information retrieval that predict dates via language models
over a discretized timeline. Unlike most previous works, we rely {\it solely}
on temporal cues implicit in the text. We consider both document-likelihood and
divergence based techniques and several smoothing methods for both of them. Our
best model predicts the mid-point of individuals' lives with a median of 22 and
mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present
day. We also show that this approach works well when training on such
biographies and predicting dates both for non-biographical Wikipedia pages
about specific years (500 B.C. to 2010 A.D.) and for publication dates of short
stories (1798 to 2008). Together, our work shows that, even in absence of
temporal extraction resources, it is possible to achieve remarkable temporal
locality across a diverse set of texts
Recommended from our members
A joint regression modeling framework for analyzing bivariate binary data in R
We discuss some of the features of the R add-on package GJRM which implements a flexible joint modeling framework for fitting a number of multivariate response regression models under various sampling schemes. In particular,we focus on the case inwhich the user wishes to fit bivariate binary regression models in the presence of several forms of selection bias. The framework allows for Gaussian and non-Gaussian dependencies through the use of copulae, and for the association and mean parameters to depend on flexible functions of covariates. We describe some of the methodological details underpinning the bivariate binary models implemented in the package and illustrate them by fitting interpretable models of different complexity on three data-sets
Knowledge-based Query Expansion in Real-Time Microblog Search
Since the length of microblog texts, such as tweets, is strictly limited to
140 characters, traditional Information Retrieval techniques suffer from the
vocabulary mismatch problem severely and cannot yield good performance in the
context of microblogosphere. To address this critical challenge, in this paper,
we propose a new language modeling approach for microblog retrieval by
inferring various types of context information. In particular, we expand the
query using knowledge terms derived from Freebase so that the expanded one can
better reflect users' search intent. Besides, in order to further satisfy
users' real-time information need, we incorporate temporal evidences into the
expansion method, which can boost recent tweets in the retrieval results with
respect to a given topic. Experimental results on two official TREC Twitter
corpora demonstrate the significant superiority of our approach over baseline
methods.Comment: 9 pages, 9 figure
The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015
In this paper we retrace the recent history of statistics by analyzing all
the papers published in five prestigious statistical journals since 1970,
namely: Annals of Statistics, Biometrika, Journal of the American Statistical
Association, Journal of the Royal Statistical Society, series B and Statistical
Science. The aim is to construct a kind of "taxonomy" of the statistical papers
by organizing and by clustering them in main themes. In this sense being
identified in a cluster means being important enough to be uncluttered in the
vast and interconnected world of the statistical research. Since the main
statistical research topics naturally born, evolve or die during time, we will
also develop a dynamic clustering strategy, where a group in a time period is
allowed to migrate or to merge into different groups in the following one.
Results show that statistics is a very dynamic and evolving science, stimulated
by the rise of new research questions and types of data
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
- …