108 research outputs found

    Crowdsourcing Dialect Characterization through Twitter

    Get PDF
    We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.Comment: 10 pages, 5 figure

    Exploiting Text and Network Context for Geolocation of Social Media Users

    Full text link
    Research on automatically geolocating social media users has conventionally been based on the text content of posts from a given user or the social network of the user, with very little crossover between the two, and no bench-marking of the two approaches over compara- ble datasets. We bring the two threads of research together in first proposing a text-based method based on adaptive grids, followed by a hybrid network- and text-based method. Evaluating over three Twitter datasets, we show that the empirical difference between text- and network-based methods is not great, and that hybridisation of the two is superior to the component methods, especially in contexts where the user graph is not well connected. We achieve state-of-the-art results on all three datasets

    Jumping Finite Automata for Tweet Comprehension

    Get PDF
    Every day, over one billion social media text messages are generated worldwide, which provides abundant information that can lead to improvements in lives of people through evidence-based decision making. Twitter is rich in such data but there are a number of technical challenges in comprehending tweets including ambiguity of the language used in tweets which is exacerbated in under resourced languages. This paper presents an approach based on Jumping Finite Automata for automatic comprehension of tweets. We construct a WordNet for the language of Kenya (WoLK) based on analysis of tweet structure, formalize the space of tweet variation and abstract the space on a Finite Automata. In addition, we present a software tool called Automata-Aided Tweet Comprehension (ATC) tool that takes raw tweets as input, preprocesses, recognise the syntax and extracts semantic information to 86% success rate

    Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization

    Full text link
    Geographically annotated social media is extremely valuable for modern information retrieval. However, when researchers can only access publicly-visible data, one quickly finds that social media users rarely publish location information. In this work, we provide a method which can geolocate the overwhelming majority of active Twitter users, independent of their location sharing preferences, using only publicly-visible Twitter data. Our method infers an unknown user's location by examining their friend's locations. We frame the geotagging problem as an optimization over a social network with a total variation-based objective and provide a scalable and distributed algorithm for its solution. Furthermore, we show how a robust estimate of the geographic dispersion of each user's ego network can be used as a per-user accuracy measure which is effective at removing outlying errors. Leave-many-out evaluation shows that our method is able to infer location for 101,846,236 Twitter users at a median error of 6.38 km, allowing us to geotag over 80\% of public tweets.Comment: 9 pages, 8 figures, accepted to IEEE BigData 2014, Compton, Ryan, David Jurgens, and David Allen. "Geotagging one hundred million twitter accounts with total variation minimization." Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 201
    • …
    corecore