28 research outputs found
Recommended from our members
Tracing the German Centennial Flood in the Stream of Tweets: First Lessons Learned
Social microblogging services such as Twitter result in massive streams of georeferenced messages and geolocated status updates. This real-time source of information is invaluable for many application areas, in particular for disaster detection and response scenarios. Consequently, a considerable number of works has dealt with issues of their acquisition, analysis and visualization. Most of these works not only assume an appropriate percentage of georeferenced messages that allows for detecting relevant events for a specific region and time frame, but also that these geolocations are reasonably correct in representing places and times of the underlying spatio-temporal situation. In this paper, we review these two key assumption based on the results of applying a visual analytics approach to a dataset of georeferenced Tweets from Germany over eight months witnessing several large-scale flooding situations throughout the country. Our results con rm the potential of Twitter as a distributed 'social sensor' but at the same time highlight some caveats in interpreting immediate results. To overcome these limits we explore incorporating evidence from other data sources including further social media and mobile phone network metrics to detect, confirm and refine events with respect to location and time. We summarize the lessons learned from our initial analysis by proposing recommendations and outline possible future work directions
On the Accuracy of Hyper-local Geotagging of Social Media Content
Social media users share billions of items per year, only a small fraction of
which is geotagged. We present a data- driven approach for identifying
non-geotagged content items that can be associated with a hyper-local
geographic area by modeling the location distributions of hyper-local n-grams
that appear in the text. We explore the trade-off between accuracy, precision
and coverage of this method. Further, we explore differences across content
received from multiple platforms and devices, and show, for example, that
content shared via different sources and applications produces significantly
different geographic distributions, and that it is best to model and predict
location for items according to their source. Our findings show the potential
and the bounds of a data-driven approach to geotag short social media texts,
and offer implications for all applications that use data-driven approaches to
locate content.Comment: 10 page
Language influences on tweeter geolocation
We investigate the influence of language on the accuracy of geolocating Twitter users. Our analysis, using a large corpus of tweets written in thirteen languages, provides a new understanding of the reasons behind reported performance disparities between languages. The results show that data imbalance has a greater impact on accuracy than geographical coverage. A comparison between micro and macro averaging demonstrates that existing evaluation approaches are less appropriate than previously thought. Our results suggest both averaging approaches should be used to effectively evaluate geolocation
Retrieval and interpretation of textual geolocalized information based on semantic geolocalized relations
This paper describes a method for geolocalized information retrieval from natural language text and its interpretation by assigning them geographic coordinates. A proof-of-concept implementation is discussed, along with geolocalized dictionary stored in PostGIS/PostgreSQL spatial relational database. Discussed research focuses on strongly inflectional Polish language, hence additional complexity had to be taken into account. Presented method has been evaluated with the use of diverse metrics
Inferring the Origin Locations of Tweets with Quantitative Confidence
Social Internet content plays an increasingly critical role in many domains,
including public health, disaster management, and politics. However, its
utility is limited by missing geographic information; for example, fewer than
1.6% of Twitter messages (tweets) contain a geotag. We propose a scalable,
content-based approach to estimate the location of tweets using a novel yet
simple variant of gaussian mixture models. Further, because real-world
applications depend on quantified uncertainty for such estimates, we propose
novel metrics of accuracy, precision, and calibration, and we evaluate our
approach accordingly. Experiments on 13 million global, comprehensively
multi-lingual tweets show that our approach yields reliable, well-calibrated
results competitive with previous computationally intensive methods. We also
show that a relatively small number of training data are required for good
estimates (roughly 30,000 tweets) and models are quite time-invariant
(effective on tweets many weeks newer than the training set). Finally, we show
that toponyms and languages with small geographic footprint provide the most
useful location signals.Comment: 14 pages, 6 figures. Version 2: Move mathematics to appendix, 2 new
references, various other presentation improvements. Version 3: Various
presentation improvements, accepted at ACM CSCW 201
Towards Real-Time, Country-Level Location Classification of Worldwide Tweets
In contrast to much previous work that has focused on location classification
of tweets restricted to a specific country, here we undertake the task in a
broader context by classifying global tweets at the country level, which is so
far unexplored in a real-time scenario. We analyse the extent to which a
tweet's country of origin can be determined by making use of eight
tweet-inherent features for classification. Furthermore, we use two datasets,
collected a year apart from each other, to analyse the extent to which a model
trained from historical tweets can still be leveraged for classification of new
tweets. With classification experiments on all 217 countries in our datasets,
as well as on the top 25 countries, we offer some insights into the best use of
tweet-inherent features for an accurate country-level classification of tweets.
We find that the use of a single feature, such as the use of tweet content
alone -- the most widely used feature in previous work -- leaves much to be
desired. Choosing an appropriate combination of both tweet content and metadata
can actually lead to substantial improvements of between 20\% and 50\%. We
observe that tweet content, the user's self-reported location and the user's
real name, all of which are inherent in a tweet and available in a real-time
scenario, are particularly useful to determine the country of origin. We also
experiment on the applicability of a model trained on historical tweets to
classify new tweets, finding that the choice of a particular combination of
features whose utility does not fade over time can actually lead to comparable
performance, avoiding the need to retrain. However, the difficulty of achieving
accurate classification increases slightly for countries with multiple
commonalities, especially for English and Spanish speaking countries.Comment: Accepted for publication in IEEE Transactions on Knowledge and Data
Engineering (IEEE TKDE