1,895 research outputs found
On the Accuracy of Hyper-local Geotagging of Social Media Content
Social media users share billions of items per year, only a small fraction of
which is geotagged. We present a data- driven approach for identifying
non-geotagged content items that can be associated with a hyper-local
geographic area by modeling the location distributions of hyper-local n-grams
that appear in the text. We explore the trade-off between accuracy, precision
and coverage of this method. Further, we explore differences across content
received from multiple platforms and devices, and show, for example, that
content shared via different sources and applications produces significantly
different geographic distributions, and that it is best to model and predict
location for items according to their source. Our findings show the potential
and the bounds of a data-driven approach to geotag short social media texts,
and offer implications for all applications that use data-driven approaches to
locate content.Comment: 10 page
Analyzing the Language of Food on Social Media
We investigate the predictive power behind the language of food on social
media. We collect a corpus of over three million food-related posts from
Twitter and demonstrate that many latent population characteristics can be
directly predicted from this data: overweight rate, diabetes rate, political
leaning, and home geographical location of authors. For all tasks, our
language-based models significantly outperform the majority-class baselines.
Performance is further improved with more complex natural language processing,
such as topic modeling. We analyze which textual features have most predictive
power for these datasets, providing insight into the connections between the
language of food, geographic locale, and community characteristics. Lastly, we
design and implement an online system for real-time query and visualization of
the dataset. Visualization tools, such as geo-referenced heatmaps,
semantics-preserving wordclouds and temporal histograms, allow us to discover
more complex, global patterns mirrored in the language of food.Comment: An extended abstract of this paper will appear in IEEE Big Data 201
Confounds and Consequences in Geotagged Twitter Data
Twitter is often used in quantitative studies that identify
geographically-preferred topics, writing styles, and entities. These studies
rely on either GPS coordinates attached to individual messages, or on the
user-supplied location field in each profile. In this paper, we compare these
data acquisition techniques and quantify the biases that they introduce; we
also measure their effects on linguistic analysis and text-based geolocation.
GPS-tagging and self-reported locations yield measurably different corpora, and
these linguistic differences are partially attributable to differences in
dataset composition by age and gender. Using a latent variable model to induce
age and gender, we show how these demographic variables interact with geography
to affect language use. We also show that the accuracy of text-based
geolocation varies with population demographics, giving the best results for
men above the age of 40.Comment: final version for EMNLP 201
Recommended from our members
Extracting Semantics of Individual Places from Movement Data by Analyzing Temporal Patterns of Visits
Data reflecting movements of people, such as GPS or GSM tracks, can be a source of information about mobility behaviors and activities of people. Such information is required for various kinds of spatial planning in the public and business sectors. Movement data by themselves are semantically poor. Meaningful information can be derived by means of interactive visual analysis performed by a human expert; however, this is only possible for data about a small number of people. We suggest an approach that allows scaling to large datasets reflecting movements of numerous people. It includes extracting stops, clustering them for identifying personal places of interest (POIs), and creating temporal signatures of the POIs characterizing the temporal distribution of the stops with respect to the daily and weekly time cycles and the time line. The analyst can give meanings to selected POIs based on their temporal signatures (i.e., classify them as home, work, etc.), and then POIs with similar signatures can be classified automatically. We demonstrate the possibilities for interactive visual semantic analysis by example of GSM, GPS, and Twitter data. GPS data allow inferring richer semantic information, but temporal signatures alone may be insufficient for interpreting short stops. Twitter data are similar to GSM data but additionally contain message texts, which can help in place interpretation. We plan to develop an intelligent system that learns how to classify personal places and trips while a human analyst visually analyzes and semantically annotates selected subsets of movement data
- …