Geolocating Twitter users---the task of identifying their home
locations---serves a wide range of community and business applications such as
managing natural crises, journalism, and public health. Many approaches have
been proposed for automatically geolocating users based on their tweets; at the
same time, various evaluation metrics have been proposed to measure the
effectiveness of these approaches, making it challenging to understand which of
these metrics is the most suitable for this task. In this paper, we propose a
guide for a standardized evaluation of Twitter user geolocation by analyzing
fifteen models and two baselines in a controlled experimental setting. Models
are evaluated using ten metrics over four geographic granularities. We use rank
correlations to assess the effectiveness of these metrics.
Our results demonstrate that the choice of effectiveness metric can have a
substantial impact on the conclusions drawn from a geolocation system
experiment, potentially leading experimenters to contradictory results about
relative effectiveness. We show that for general evaluations, a range of
performance metrics should be reported, to ensure that a complete picture of
system effectiveness is conveyed. Given the global geographic coverage of this
task, we specifically recommend evaluation at micro versus macro levels to
measure the impact of the bias in distribution over locations. Although a lot
of complex geolocation algorithms have been applied in recent years, a majority
class baseline is still competitive at coarse geographic granularity. We
propose a suite of statistical analysis tests, based on the employed metric, to
ensure that the results are not coincidental.Comment: Accepted in the journal of ACM Transactions on Social Computing
(TSC). Extended version of the ASONAM 2018 short paper. Please cite the
TSC/ASONAM version and not the arxiv versio