108 research outputs found
Crowdsourcing Dialect Characterization through Twitter
We perform a large-scale analysis of language diatopic variation using
geotagged microblogging datasets. By collecting all Twitter messages written in
Spanish over more than two years, we build a corpus from which a carefully
selected list of concepts allows us to characterize Spanish varieties on a
global scale. A cluster analysis proves the existence of well defined
macroregions sharing common lexical properties. Remarkably enough, we find that
Spanish language is split into two superdialects, namely, an urban speech used
across major American and Spanish citites and a diverse form that encompasses
rural areas and small towns. The latter can be further clustered into smaller
varieties with a stronger regional character.Comment: 10 pages, 5 figure
Exploiting Text and Network Context for Geolocation of Social Media Users
Research on automatically geolocating social media users has conventionally
been based on the text content of posts from a given user or the social network
of the user, with very little crossover between the two, and no bench-marking
of the two approaches over compara- ble datasets. We bring the two threads of
research together in first proposing a text-based method based on adaptive
grids, followed by a hybrid network- and text-based method. Evaluating over
three Twitter datasets, we show that the empirical difference between text- and
network-based methods is not great, and that hybridisation of the two is
superior to the component methods, especially in contexts where the user graph
is not well connected. We achieve state-of-the-art results on all three
datasets
Jumping Finite Automata for Tweet Comprehension
Every day, over one billion social media text messages are generated worldwide, which provides abundant information that can lead to improvements in lives of people through evidence-based decision making. Twitter is rich in such data but there are a number of technical challenges in comprehending tweets including ambiguity of the language used in tweets which is exacerbated in under resourced languages. This paper presents an approach based on Jumping Finite Automata for automatic comprehension of tweets. We construct a WordNet for the language of Kenya (WoLK) based on analysis of tweet structure, formalize the space of tweet variation and abstract the space on a Finite Automata. In addition, we present a software tool called Automata-Aided Tweet Comprehension (ATC) tool that takes raw tweets as input, preprocesses, recognise the syntax and extracts semantic information to 86% success rate
Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization
Geographically annotated social media is extremely valuable for modern
information retrieval. However, when researchers can only access
publicly-visible data, one quickly finds that social media users rarely publish
location information. In this work, we provide a method which can geolocate the
overwhelming majority of active Twitter users, independent of their location
sharing preferences, using only publicly-visible Twitter data.
Our method infers an unknown user's location by examining their friend's
locations. We frame the geotagging problem as an optimization over a social
network with a total variation-based objective and provide a scalable and
distributed algorithm for its solution. Furthermore, we show how a robust
estimate of the geographic dispersion of each user's ego network can be used as
a per-user accuracy measure which is effective at removing outlying errors.
Leave-many-out evaluation shows that our method is able to infer location for
101,846,236 Twitter users at a median error of 6.38 km, allowing us to geotag
over 80\% of public tweets.Comment: 9 pages, 8 figures, accepted to IEEE BigData 2014, Compton, Ryan,
David Jurgens, and David Allen. "Geotagging one hundred million twitter
accounts with total variation minimization." Big Data (Big Data), 2014 IEEE
International Conference on. IEEE, 201
- …