301 research outputs found

    Correcting Sociodemographic Selection Biases for Population Prediction from Social Media

    Full text link
    Social media is increasingly used for large-scale population predictions, such as estimating community health statistics. However, social media users are not typically a representative sample of the intended population -- a "selection bias". Within the social sciences, such a bias is typically addressed with restratification techniques, where observations are reweighted according to how under- or over-sampled their socio-demographic groups are. Yet, restratifaction is rarely evaluated for improving prediction. Across four tasks of predicting U.S. county population health statistics from Twitter, we find standard restratification techniques provide no improvement and often degrade prediction accuracies. The core reasons for this seems to be both shrunken estimates (reduced variance of model predicted values) and sparse estimates of each population's socio-demographics. We thus develop and evaluate three methods to address these problems: estimator redistribution to account for shrinking, and adaptive binning and informed smoothing to handle sparse socio-demographic estimates. We show that each of these methods significantly outperforms the standard restratification approaches. Combining approaches, we find substantial improvements over non-restratified models, yielding a 53.0% increase in predictive accuracy (R^2) in the case of surveyed life satisfaction, and a 17.8% average increase across all tasks

    Do children's expectations about future physical activity predict their physical activity in adulthood?

    Get PDF
    BACKGROUND: Much of the population fails to meet recommended physical activity (PA) levels, but there remains considerable individual variation. By understanding drivers of different trajectories, interventions can be better targeted and more effective. One such driver may be a person's physical activity identity (PAI)-the extent to which a person perceives PA as central to who they are. METHODS: Using survey information and a unique body of essays written at age 11 from the National Child Development Study (N = 10 500), essays mentioning PA were automatically identified using the machine learning technique support vector classification and PA trajectories were estimated using latent class analysis. Analyses tested the extent to which childhood PAI correlated with activity levels from age 23 through 55 and with trajectories across adulthood. RESULTS: 42.2% of males and 33.5% of females mentioned PA in their essays, describing active and/or passive engagement. Active PAI in childhood was correlated with higher levels of activity for men but not women, and was correlated with consistently active PA trajectories for both genders. Passive PAI was not related to PA for either gender. CONCLUSIONS: This study offers a novel approach for analysing large qualitative datasets to assess identity and behaviours. Findings suggest that at as young as 11 years old, the way a young person conceptualizes activity as part of their identity has a lasting association with behaviour. Still, an active identity may require a supportive sociocultural context to manifest in subsequent behaviour

    Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

    Get PDF
    Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web Conference (WWW '19

    Intersectional Identities and Machine Learning: Illuminating Language Biases in Twitter Algorithms

    Get PDF
    Intersectional analysis of social media data is rare. Social media data is ripe for identity and intersectionality analysis with wide accessibility and easy to parse text data yet provides a host of its own methodological challenges regarding the identification of identities. We aggregate Twitter data that was annotated by crowdsourcing for tags of “abusive,” “hateful,” or “spam” language. Using natural language prediction models, we predict the tweeter’s race and gender and investigate whether these tags for abuse, hate, and spam have a meaningful relationship with the gendered and racialized language predictions. Are certain gender and race groups more likely to be predicted if a tweet is labeled as abusive, hateful, or spam? The findings suggest that certain racial and intersectional groups are more likely to be associated with non-normal language identification. Language consistent with white identity is most likely to be considered within the norm and non-white racial groups are more often linked to hateful, abusive, or spam language
    • 

    corecore