1,540 research outputs found

    Towards Real-Time, Country-Level Location Classification of Worldwide Tweets

    Get PDF
    In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyse the extent to which a tweet's country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyse the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone -- the most widely used feature in previous work -- leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20\% and 50\%. We observe that tweet content, the user's self-reported location and the user's real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification increases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.Comment: Accepted for publication in IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE

    A method for estimating individual socioeconomic status of Twitter users

    Get PDF
    The rise of social media has opened countless opportunities to explore social science questions with new data and methods. However, research on socioeconomic inequality remains constrained by limited individual-level socioeconomic status (SES) measures in digital trace data. Following Bourdieu, we argue that the commercial and entertainment accounts Twitter users follow reflect their economic and cultural capital. Adapting a political science method for inferring political ideology, we use correspondence analysis to estimate the SES of 3,482,652 Twitter users who follow the accounts of 339 brands in the United States. We validate our estimates with data from the Facebook Marketing application programming interface, self-reported job titles on users’ Twitter profiles, and a small survey sample. The results show reasonable correlations with the standard proxies for SES, alongside much weaker or nonsignificant correlations with other demographic variables. The proposed method opens new opportunities for innovative social research on inequality on Twitter and similar online platforms

    Predicting Twitter user socioeconomic attributes with network and language information

    Get PDF
    Inferring socioeconomic attributes of social media users such as occupation and income is an important problem in computational social science. Automated inference of such characteristics has applications in personalised recommender systems, targeted computational advertising and online political campaigning. While previous work has shown that language features can reliably predict socioeconomic attributes on Twitter, employing information coming from users' social networks has not yet been explored for such complex user characteristics. In this paper, we describe a method for predicting the occupational class and the income of Twitter users given information extracted from their extended networks by learning a low-dimensional vector representation of users, i.e. graph embeddings. We use this representation to train predictive models for occupational class and income. Results on two publicly available datasets show that our method consistently outperforms the state-of-the-art methods in both tasks. We also obtain further significant improvements when we combine graph embeddings with textual features, demonstrating that social network and language information are complementary

    Linking Local Weather To Climate Change: One Year Of Twitter In The US

    Get PDF
    There is a high level of scientific consensus on climate change. Nevertheless for climate change research to have any practical value, to develop public support for climate policies, the climate research results must find the way to general public. That is why it is important to understand how the public perception of climate change forms. During the last decades there have been a number of studies on the factors affecting the level of public concern on climate change. Two major groups of factors are hypothesized to have the biggest influence on the level of public concern on climate change: extreme weather events and the mass media topic coverage. Local studies confirm that the weather events experienced by people in certain locations might be related to climate change. In 1998 James Hansen hypothesized that two weather parameters\u27 variations, namely, temperature and precipitation, exceeding one standard deviation should be noticeable by people and result in increase of the level of public concern on the phenomena. Nevertheless no previous studies were able to test this hypothesis and demonstrate that people truly use the information about local weather to make assumptions about climate change. The other studies on public perception of climate change are generally based on the agenda-setting theory, stating that the level of public concern on the issue is a reflection of the extent and prominence of media coverage of the topic. The previous studies on how public perception of climate change forms are mainly based surveys, which is an active approach to collect social data. With the development of social media, however, a passive surveying of public perceptions on climate change has become possible. In this thesis the change in climate change microblogging intensity in Twitter was used as a proxy of change in the level of concern on the issue. The objectives of the study were to utilize the Twitter, a currently the most popular microblogging platform, as a source of public salience data to test if the changes in weather parameters and in media coverage result in changes of the level of public concern on climate change. For this purpose the multiple linear regression and multi-model inference statistical techniques were used on three geographical levels of data aggregation. The results clearly show that changes in weather parameters have significant effect on the level of public concern on climate change on the national, regional and local scales. The mass media topic coverage was also positively associated with the level of public concern on the national level. The study demonstrated that the social media data provides unprecedented opportunities for public opinion research

    Computational socioeconomics

    Get PDF
    Uncovering the structure of socioeconomic systems and timely estimation of socioeconomic status are significant for economic development. The understanding of socioeconomic processes provides foundations to quantify global economic development, to map regional industrial structure, and to infer individual socioeconomic status. In this review, we will make a brief manifesto about a new interdisciplinary research field named Computational Socioeconomics, followed by detailed introduction about data resources, computational tools, data-driven methods, theoretical models and novel applications at multiple resolutions, including the quantification of global economic inequality and complexity, the map of regional industrial structure and urban perception, the estimation of individual socioeconomic status and demographic, and the real-time monitoring of emergent events. This review, together with pioneering works we have highlighted, will draw increasing interdisciplinary attentions and induce a methodological shift in future socioeconomic studies

    IDENTIFYING A CUSTOMER CENTERED APPROACH FOR URBAN PLANNING: DEFINING A FRAMEWORK AND EVALUATING POTENTIAL IN A LIVABILITY CONTEXT

    Get PDF
    In transportation planning, public engagement is an essential requirement forinformed decision-making. This is especially true for assessing abstract concepts such aslivability, where it is challenging to define objective measures and to obtain input that canbe used to gauge performance of communities. This dissertation focuses on advancing adata-driven decision-making approach for the transportation planning domain in thecontext of livability. First, a conceptual model for a customer-centric framework fortransportation planning is designed integrating insight from multiple disciplines (chapter1), then a data-mining approach to extracting features important for defining customersatisfaction in a livability context is described (chapter 2), and finally an appraisal of thepotential of social media review mining for enhancing understanding of livability measuresand increasing engagement in the planning process is undertaken (chapter 3). The resultsof this work also include a sentiment analysis and visualization package for interpreting anautomated user-defined translation of qualitative measures of livability. The packageevaluates users satisfaction of neighborhoods through social media and enhances thetraditional approaches to defining livability planning measures. This approach has thepotential to capitalize on residents interests in social media outlets and to increase publicengagement in the planning process by encouraging users to participate in onlineneighborhood satisfaction reporting. The results inform future work for deploying acomprehensive approach to planning that draws the marketing structure of transportationnetwork products with residential nodes as the center of the structure

    Essays in empirical economics

    Get PDF
    While poverty rates have declined in recent decades, many people are still trapped in poverty with limited opportunities for better living conditions. Moreover, inequality remains high around the globe. Understanding and addressing poverty and inequality is a complex task because it is multidimensional and involves multiple actors. My dissertation contributes to the literature on poverty reduction and inequality by taking an in-depth look at the three channels of Attanasio and Székely's (1999) asset-based framework and relating them to the three actors identified by McKague, Wheeler, and Karnani (2015). It is my hope that my work will shed light on how to address some of the multidimensional aspects of inequality. In Chapter 1, I explore the human capital dimension of poverty and inequality and the potential role governments can play in addressing inequality. Next, in Chapter 2, my thesis ties into the social capital channel of the asset-based framework and analyzes the influence of civil societies. Finally, Chapter 3 speaks to the physical capital channel of the asset-based model and to the potential responsibility of the private sector in addressing poverty and inequality

    Policy and Place: A Spatial Data Science Framework for Research and Decision-Making

    Get PDF
    abstract: A major challenge in health-related policy and program evaluation research is attributing underlying causal relationships where complicated processes may exist in natural or quasi-experimental settings. Spatial interaction and heterogeneity between units at individual or group levels can violate both components of the Stable-Unit-Treatment-Value-Assumption (SUTVA) that are core to the counterfactual framework, making treatment effects difficult to assess. New approaches are needed in health studies to develop spatially dynamic causal modeling methods to both derive insights from data that are sensitive to spatial differences and dependencies, and also be able to rely on a more robust, dynamic technical infrastructure needed for decision-making. To address this gap with a focus on causal applications theoretically, methodologically and technologically, I (1) develop a theoretical spatial framework (within single-level panel econometric methodology) that extends existing theories and methods of causal inference, which tend to ignore spatial dynamics; (2) demonstrate how this spatial framework can be applied in empirical research; and (3) implement a new spatial infrastructure framework that integrates and manages the required data for health systems evaluation. The new spatially explicit counterfactual framework considers how spatial effects impact treatment choice, treatment variation, and treatment effects. To illustrate this new methodological framework, I first replicate a classic quasi-experimental study that evaluates the effect of drinking age policy on mortality in the United States from 1970 to 1984, and further extend it with a spatial perspective. In another example, I evaluate food access dynamics in Chicago from 2007 to 2014 by implementing advanced spatial analytics that better account for the complex patterns of food access, and quasi-experimental research design to distill the impact of the Great Recession on the foodscape. Inference interpretation is sensitive to both research design framing and underlying processes that drive geographically distributed relationships. Finally, I advance a new Spatial Data Science Infrastructure to integrate and manage data in dynamic, open environments for public health systems research and decision- making. I demonstrate an infrastructure prototype in a final case study, developed in collaboration with health department officials and community organizations.Dissertation/ThesisDoctoral Dissertation Geography 201

    BALANCING THE ASSUMPTIONS OF CAUSAL INFERENCE AND NATURAL LANGUAGE PROCESSING

    Get PDF
    Drawing conclusions about real-world relationships of cause and effect from data collected without randomization requires making assumptions about the true processes that generate the data we observe. Causal inference typically considers low-dimensional data such as categorical or numerical fields in structured medical records. Yet a restriction to such data excludes natural language texts -- including social media posts or clinical free-text notes -- that can provide a powerful perspective into many aspects of our lives. This thesis explores whether the simplifying assumptions we make in order to model human language and behavior can support the causal conclusions that are necessary to inform decisions in healthcare or public policy. An analysis of millions of documents must rely on automated methods from machine learning and natural language processing, yet trust is essential in many clinical or policy applications. We need to develop causal methods that can reflect the uncertainty of imperfect predictive models to inform robust decision-making. We explore several areas of research in pursuit of these goals. We propose a measurement error approach for incorporating text classifiers into causal analyses and demonstrate the assumption on which it relies. We introduce a framework for generating synthetic text datasets on which causal inference methods can be evaluated, and use it to demonstrate that many existing approaches make assumptions that are likely violated. We then propose a proxy model methodology that provides explanations for uninterpretable black-box models, and close by incorporating it into our measurement error approach to explore the assumptions necessary for an analysis of gender and toxicity on Twitter
    • …
    corecore