392 research outputs found
Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks
We propose a method for embedding two-dimensional locations in a continuous
vector space using a neural network-based model incorporating mixtures of
Gaussian distributions, presenting two model variants for text-based
geolocation and lexical dialectology. Evaluated over Twitter data, the proposed
model outperforms conventional regression-based geolocation and provides a
better estimate of uncertainty. We also show the effectiveness of the
representation for predicting words from location in lexical dialectology, and
evaluate it using the DARE dataset.Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP
2017) September 2017, Copenhagen, Denmar
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
A Hierarchical Location Prediction Neural Network for Twitter User Geolocation
Accurate estimation of user location is important for many online services.
Previous neural network based methods largely ignore the hierarchical structure
among locations. In this paper, we propose a hierarchical location prediction
neural network for Twitter user geolocation. Our model first predicts the home
country for a user, then uses the country result to guide the city-level
prediction. In addition, we employ a character-aware word embedding layer to
overcome the noisy information in tweets. With the feature fusion layer, our
model can accommodate various feature combinations and achieves
state-of-the-art results over three commonly used benchmarks under different
feature settings. It not only improves the prediction accuracy but also greatly
reduces the mean error distance.Comment: Accepted by EMNLP 201
A Transformer-based Framework for POI-level Social Post Geolocation
POI-level geo-information of social posts is critical to many location-based
applications and services. However, the multi-modality, complexity and diverse
nature of social media data and their platforms limit the performance of
inferring such fine-grained locations and their subsequent applications. To
address this issue, we present a transformer-based general framework, which
builds upon pre-trained language models and considers non-textual data, for
social post geolocation at the POI level. To this end, inputs are categorized
to handle different social data, and an optimal combination strategy is
provided for feature representations. Moreover, a uniform representation of
hierarchy is proposed to learn temporal information, and a concatenated version
of encodings is employed to capture feature-wise positions better. Experimental
results on various social datasets demonstrate that three variants of our
proposed framework outperform multiple state-of-art baselines by a large margin
in terms of accuracy and distance error metrics.Comment: Full papers are 12 pages in length plus additional 4 pages for
references (turns to 18 pages in total after submitting to arxiv). One figure
and 5 tables are contained. This paper was submitted to ECIR 2023 for revie
Influence of geographic biases on geolocation prediction in Twitter
Geolocating Twitter users --- the task of identifying their home locations --- serves a wide range of community and business applications such as managing natural crises, journalism, and public health. While users can record their location on their profiles, more than 34% record fake or sarcastic locations. Twitter allows users to GPS locate their content, however, less than 1% of tweets are geotagged. Therefore, inferring user location has been an important field of investigation since 2010. This thesis investigates two of the most important factors which can affect the quality of inferring user location: (i) the influence of tweet-language; and (ii) the effectiveness of the evaluation process. Previous research observed that Twitter users writing in some languages appeared to be easier to locate than those writing in others. They speculated that the geographic coverage of a language (language bias) --- represented by the number of locations where the tweets of a specific language come from --- played an important role in determining location accuracy. So important was this role that accuracy might be largely predictable by considering language alone. In this thesis, I investigate the influence of language bias on the accuracy of geolocating Twitter users. The analysis, using a large corpus of tweets written in thirteen languages and a re-implemented state-of-the-art geolocation model back at the time, provides a new understanding of the reasons behind reported performance disparities between languages. The results show that data imbalance in the distribution of Twitter users over locations (population bias) has a greater impact on accuracy than language bias. A comparison between micro and macro averaging demonstrates that existing evaluation approaches are less appropriate than previously thought. The results suggest both averaging approaches should be used to effectively evaluate geolocation. Many approaches have been proposed for automatically geolocating users; at the same time, various evaluation metrics have been proposed to measure the effectiveness of these approaches, making it challenging to understand which of these metrics is the most suitable for this task. In this thesis, I provide a standardized evaluation framework for geolocation systems. The framework is employed to analyze fifteen Twitter user geolocation models and two baselines in a controlled experimental setting. The models are composed of the re-implemented model and a variation of it, two locally retrained open source models and the results of eleven models submitted to a shared task. Models are evaluated using ten metrics --- out of fourteen employed in previous research --- over four geographic granularities. Rank correlations and thorough statistical analysis are used to assess the effectiveness of these metrics. The results demonstrate that the choice of effectiveness metric can have a substantial impact on the conclusions drawn from a geolocation system experiment, potentially leading experimenters to contradictory results about relative effectiveness. For general evaluations, a range of performance metrics should be reported, to ensure that a complete picture of system effectiveness is conveyed. Although a lot of complex geolocation algorithms have been applied in recent years, a majority class baseline is still competitive at coarse geographic granularity. A suite of statistical analysis tests is proposed, based on the employed metric, to ensure that the results are not coincidental
- …