243 research outputs found
On the Accuracy of Hyper-local Geotagging of Social Media Content
Social media users share billions of items per year, only a small fraction of
which is geotagged. We present a data- driven approach for identifying
non-geotagged content items that can be associated with a hyper-local
geographic area by modeling the location distributions of hyper-local n-grams
that appear in the text. We explore the trade-off between accuracy, precision
and coverage of this method. Further, we explore differences across content
received from multiple platforms and devices, and show, for example, that
content shared via different sources and applications produces significantly
different geographic distributions, and that it is best to model and predict
location for items according to their source. Our findings show the potential
and the bounds of a data-driven approach to geotag short social media texts,
and offer implications for all applications that use data-driven approaches to
locate content.Comment: 10 page
Analyzing the Language of Food on Social Media
We investigate the predictive power behind the language of food on social
media. We collect a corpus of over three million food-related posts from
Twitter and demonstrate that many latent population characteristics can be
directly predicted from this data: overweight rate, diabetes rate, political
leaning, and home geographical location of authors. For all tasks, our
language-based models significantly outperform the majority-class baselines.
Performance is further improved with more complex natural language processing,
such as topic modeling. We analyze which textual features have most predictive
power for these datasets, providing insight into the connections between the
language of food, geographic locale, and community characteristics. Lastly, we
design and implement an online system for real-time query and visualization of
the dataset. Visualization tools, such as geo-referenced heatmaps,
semantics-preserving wordclouds and temporal histograms, allow us to discover
more complex, global patterns mirrored in the language of food.Comment: An extended abstract of this paper will appear in IEEE Big Data 201
Exploiting Text and Network Context for Geolocation of Social Media Users
Research on automatically geolocating social media users has conventionally
been based on the text content of posts from a given user or the social network
of the user, with very little crossover between the two, and no bench-marking
of the two approaches over compara- ble datasets. We bring the two threads of
research together in first proposing a text-based method based on adaptive
grids, followed by a hybrid network- and text-based method. Evaluating over
three Twitter datasets, we show that the empirical difference between text- and
network-based methods is not great, and that hybridisation of the two is
superior to the component methods, especially in contexts where the user graph
is not well connected. We achieve state-of-the-art results on all three
datasets
A Coherent Unsupervised Model for Toponym Resolution
Toponym Resolution, the task of assigning a location mention in a document to
a geographic referent (i.e., latitude/longitude), plays a pivotal role in
analyzing location-aware content. However, the ambiguities of natural language
and a huge number of possible interpretations for toponyms constitute
insurmountable hurdles for this task. In this paper, we study the problem of
toponym resolution with no additional information other than a gazetteer and no
training data. We demonstrate that a dearth of large enough annotated data
makes supervised methods less capable of generalizing. Our proposed method
estimates the geographic scope of documents and leverages the connections
between nearby place names as evidence to resolve toponyms. We explore the
interactions between multiple interpretations of mentions and the relationships
between different toponyms in a document to build a model that finds the most
coherent resolution. Our model is evaluated on three news corpora, two from the
literature and one collected and annotated by us; then, we compare our methods
to the state-of-the-art unsupervised and supervised techniques. We also examine
three commercial products including Reuters OpenCalais, Yahoo! YQL Placemaker,
and Google Cloud Natural Language API. The evaluation shows that our method
outperforms the unsupervised technique as well as Reuters OpenCalais and Google
Cloud Natural Language API on all three corpora; also, our method shows a
performance close to that of the state-of-the-art supervised method and
outperforms it when the test data has 40% or more toponyms that are not seen in
the training data.Comment: 9 pages (+1 page reference), WWW '18 Proceedings of the 2018 World
Wide Web Conferenc
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
A pragmatic guide to geoparsing evaluation
Abstract: Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym taxonomy with implications for Named Entity Recognition (NER) and beyond. To address these deficiencies, our manuscript introduces a new framework in three parts. (Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms. (Part 2) Metrics: discussed and reviewed for a rigorous evaluation including recommendations for NER/Geoparsing practitioners. (Part 3) Evaluation data: shared via a new dataset called GeoWebNews to provide test/train examples and enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping and evaluating machine learning NLP models
Recommended from our members
Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring
The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI
"I’m Eating a Sandwich in Glasgow": Modeling locations with tweets
Social media such as Twitter generate large quantities of data about what a person is thinking and doing in a partic- ular location. We leverage this data to build models of locations to improve our understanding of a user’s geographic context. Understanding the user’s geographic context can in turn enable a variety of services that allow us to present information, recommend businesses and services, and place advertisements that are relevant at a hyper-local level.
In this paper we create language models of locations using coordinates extracted from geotagged Twitter data. We model locations at varying levels of granularity, from the zip code to the country level. We measure the accuracy of these models by the degree to which we can predict the location of an individual tweet, and further by the accuracy with which we can predict the location of a user. We find that we can meet the performance of the industry standard tool for pre- dicting both the tweet and the user at the country, state and city levels, and far exceed its performance at the hyper-local level, achieving a three- to ten-fold increase in accuracy at the zip code level
Geolocation Predicting of Tweets Using BERT-Based Models
This research is aimed to solve the tweet/user geolocation prediction task
and provide a flexible methodology for the geotagging of textual big data. The
suggested approach implements neural networks for natural language processing
(NLP) to estimate the location as coordinate pairs (longitude, latitude) and
two-dimensional Gaussian Mixture Models (GMMs). The scope of proposed models
has been finetuned on a Twitter dataset using pretrained Bidirectional Encoder
Representations from Transformers (BERT) as base models. Performance metrics
show a median error of fewer than 30 km on a worldwide-level, and fewer than 15
km on the US-level datasets for the models trained and evaluated on text
features of tweets' content and metadata context.Comment: 27 pages, 6 tables, 7 figure
- …