808 research outputs found
Inferring the Origin Locations of Tweets with Quantitative Confidence
Social Internet content plays an increasingly critical role in many domains,
including public health, disaster management, and politics. However, its
utility is limited by missing geographic information; for example, fewer than
1.6% of Twitter messages (tweets) contain a geotag. We propose a scalable,
content-based approach to estimate the location of tweets using a novel yet
simple variant of gaussian mixture models. Further, because real-world
applications depend on quantified uncertainty for such estimates, we propose
novel metrics of accuracy, precision, and calibration, and we evaluate our
approach accordingly. Experiments on 13 million global, comprehensively
multi-lingual tweets show that our approach yields reliable, well-calibrated
results competitive with previous computationally intensive methods. We also
show that a relatively small number of training data are required for good
estimates (roughly 30,000 tweets) and models are quite time-invariant
(effective on tweets many weeks newer than the training set). Finally, we show
that toponyms and languages with small geographic footprint provide the most
useful location signals.Comment: 14 pages, 6 figures. Version 2: Move mathematics to appendix, 2 new
references, various other presentation improvements. Version 3: Various
presentation improvements, accepted at ACM CSCW 201
Determine the User Country of a Tweet
In the widely used message platform Twitter, about 2% of the tweets contains
the geographical location through exact GPS coordinates (latitude and
longitude). Knowing the location of a tweet is useful for many data analytics
questions. This research is looking at the determination of a location for
tweets that do not contain GPS coordinates. An accuracy of 82% was achieved
using a Naive Bayes model trained on features such as the users' timezone, the
user's language, and the parsed user location. The classifier performs well on
active Twitter countries such as the Netherlands and United Kingdom. An
analysis of errors made by the classifier shows that mistakes were made due to
limited information and shared properties between countries such as shared
timezone. A feature analysis was performed in order to see the effect of
different features. The features timezone and parsed user location were the
most informative features.Comment: CTIT Technical Report, University of Twent
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
On the Accuracy of Hyper-local Geotagging of Social Media Content
Social media users share billions of items per year, only a small fraction of
which is geotagged. We present a data- driven approach for identifying
non-geotagged content items that can be associated with a hyper-local
geographic area by modeling the location distributions of hyper-local n-grams
that appear in the text. We explore the trade-off between accuracy, precision
and coverage of this method. Further, we explore differences across content
received from multiple platforms and devices, and show, for example, that
content shared via different sources and applications produces significantly
different geographic distributions, and that it is best to model and predict
location for items according to their source. Our findings show the potential
and the bounds of a data-driven approach to geotag short social media texts,
and offer implications for all applications that use data-driven approaches to
locate content.Comment: 10 page
Immigrant community integration in world cities
As a consequence of the accelerated globalization process, today major cities
all over the world are characterized by an increasing multiculturalism. The
integration of immigrant communities may be affected by social polarization and
spatial segregation. How are these dynamics evolving over time? To what extent
the different policies launched to tackle these problems are working? These are
critical questions traditionally addressed by studies based on surveys and
census data. Such sources are safe to avoid spurious biases, but the data
collection becomes an intensive and rather expensive work. Here, we conduct a
comprehensive study on immigrant integration in 53 world cities by introducing
an innovative approach: an analysis of the spatio-temporal communication
patterns of immigrant and local communities based on language detection in
Twitter and on novel metrics of spatial integration. We quantify the "Power of
Integration" of cities --their capacity to spatially integrate diverse
cultures-- and characterize the relations between different cultures when
acting as hosts or immigrants.Comment: 13 pages, 5 figures + Appendi
Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks
We propose a method for embedding two-dimensional locations in a continuous
vector space using a neural network-based model incorporating mixtures of
Gaussian distributions, presenting two model variants for text-based
geolocation and lexical dialectology. Evaluated over Twitter data, the proposed
model outperforms conventional regression-based geolocation and provides a
better estimate of uncertainty. We also show the effectiveness of the
representation for predicting words from location in lexical dialectology, and
evaluate it using the DARE dataset.Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP
2017) September 2017, Copenhagen, Denmar
Mapping auroral activity with Twitter
Twitter is a popular, publicly-accessible, social media service that has proven useful in mapping large-scale events in real-time. In this study, for the first time, the use of Twitter as a measure of auroral activity is investigated. Peaks in the number of aurora-related tweets are found to frequently coincide with geomagnetic disturbances (detection rate of 91%). Additionally, the number of daily aurora-related tweets is found to strongly correlate with several auroral strength proxies (ravg ≈ 0.7). An examination is made of the bias for location and time of day within Twitter data, and a first order correction of these effects is presented. Overall, the results suggest that Twitter can provide both specific details about an individual aurora and accurate real-time indication of when, and even from where, an aurora is visible
- …