192 research outputs found

    Exploiting Text and Network Context for Geolocation of Social Media Users

    Full text link
    Research on automatically geolocating social media users has conventionally been based on the text content of posts from a given user or the social network of the user, with very little crossover between the two, and no bench-marking of the two approaches over compara- ble datasets. We bring the two threads of research together in first proposing a text-based method based on adaptive grids, followed by a hybrid network- and text-based method. Evaluating over three Twitter datasets, we show that the empirical difference between text- and network-based methods is not great, and that hybridisation of the two is superior to the component methods, especially in contexts where the user graph is not well connected. We achieve state-of-the-art results on all three datasets

    Towards Real-Time, Country-Level Location Classification of Worldwide Tweets

    Get PDF
    In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyse the extent to which a tweet's country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyse the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone -- the most widely used feature in previous work -- leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20\% and 50\%. We observe that tweet content, the user's self-reported location and the user's real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification increases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.Comment: Accepted for publication in IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE

    Language influences on tweeter geolocation

    Get PDF
    We investigate the influence of language on the accuracy of geolocating Twitter users. Our analysis, using a large corpus of tweets written in thirteen languages, provides a new understanding of the reasons behind reported performance disparities between languages. The results show that data imbalance has a greater impact on accuracy than geographical coverage. A comparison between micro and macro averaging demonstrates that existing evaluation approaches are less appropriate than previously thought. Our results suggest both averaging approaches should be used to effectively evaluate geolocation

    A Survey of Location Prediction on Twitter

    Full text link
    Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur

    A Survey on Classification of Geolocation of Country from Worldwide Tweets

    Get PDF
    Social media are progressively being utilized as a part of mainstream researchers as a key wellspring of information to help comprehend differing common and social term, and this has prompted the advancement of an extensive variety of computational information mining apparatuses that can remove learning from web-based social networking for both ad-hoc and ongoing examination. The expansion of enthusiasm for utilizing web-based social networking as a hotspot for look into has roused handling the test of consequently geolocating tweets, given the absence of express area data in the lion's share of tweets. As opposed to much past work that has concentrated on area grouping of tweets limited to a particular nation, here we attempt the assignment in a more extensive setting by ordering worldwide tweets at the country level, which is so far unexplored in an ongoing situation. We break down the degree to which a tweet's nation of starting point can be dictated by making utilization of eight tweet-inherent highlights for classification

    #Santiago is not #Chile, or is it? A Model to Normalize Social Media Impact

    Full text link
    Online social networks are known to be demographically biased. Currently there are questions about what degree of representativity of the physical population they have, and how population biases impact user-generated content. In this paper we focus on centralism, a problem affecting Chile. Assuming that local differences exist in a country, in terms of vocabulary, we built a methodology based on the vector space model to find distinctive content from different locations, and use it to create classifiers to predict whether the content of a micro-post is related to a particular location, having in mind a geographically diverse selection of micro-posts. We evaluate them in a case study where we analyze the virtual population of Chile that participated in the Twitter social network during an event of national relevance: the municipal (local governments) elections held in 2012. We observe that the participating virtual population is spatially representative of the physical population, implying that there is centralism in Twitter. Our classifiers out-perform a non geographically-diverse baseline at the regional level, and have the same accuracy at a provincial level. However, our approach makes assumptions that need to be tested in multi-thematic and more general datasets. We leave this for future work.Comment: Accepted in ChileCHI 2013, I Chilean Conference on Human-Computer Interactio

    Influence of geographic biases on geolocation prediction in Twitter

    Get PDF
    Geolocating Twitter users --- the task of identifying their home locations --- serves a wide range of community and business applications such as managing natural crises, journalism, and public health. While users can record their location on their profiles, more than 34% record fake or sarcastic locations. Twitter allows users to GPS locate their content, however, less than 1% of tweets are geotagged. Therefore, inferring user location has been an important field of investigation since 2010. This thesis investigates two of the most important factors which can affect the quality of inferring user location: (i) the influence of tweet-language; and (ii) the effectiveness of the evaluation process. Previous research observed that Twitter users writing in some languages appeared to be easier to locate than those writing in others. They speculated that the geographic coverage of a language (language bias) --- represented by the number of locations where the tweets of a specific language come from --- played an important role in determining location accuracy. So important was this role that accuracy might be largely predictable by considering language alone. In this thesis, I investigate the influence of language bias on the accuracy of geolocating Twitter users. The analysis, using a large corpus of tweets written in thirteen languages and a re-implemented state-of-the-art geolocation model back at the time, provides a new understanding of the reasons behind reported performance disparities between languages. The results show that data imbalance in the distribution of Twitter users over locations (population bias) has a greater impact on accuracy than language bias. A comparison between micro and macro averaging demonstrates that existing evaluation approaches are less appropriate than previously thought. The results suggest both averaging approaches should be used to effectively evaluate geolocation. Many approaches have been proposed for automatically geolocating users; at the same time, various evaluation metrics have been proposed to measure the effectiveness of these approaches, making it challenging to understand which of these metrics is the most suitable for this task. In this thesis, I provide a standardized evaluation framework for geolocation systems. The framework is employed to analyze fifteen Twitter user geolocation models and two baselines in a controlled experimental setting. The models are composed of the re-implemented model and a variation of it, two locally retrained open source models and the results of eleven models submitted to a shared task. Models are evaluated using ten metrics --- out of fourteen employed in previous research --- over four geographic granularities. Rank correlations and thorough statistical analysis are used to assess the effectiveness of these metrics. The results demonstrate that the choice of effectiveness metric can have a substantial impact on the conclusions drawn from a geolocation system experiment, potentially leading experimenters to contradictory results about relative effectiveness. For general evaluations, a range of performance metrics should be reported, to ensure that a complete picture of system effectiveness is conveyed. Although a lot of complex geolocation algorithms have been applied in recent years, a majority class baseline is still competitive at coarse geographic granularity. A suite of statistical analysis tests is proposed, based on the employed metric, to ensure that the results are not coincidental
    • …
    corecore