301 research outputs found
Correcting Sociodemographic Selection Biases for Population Prediction from Social Media
Social media is increasingly used for large-scale population predictions,
such as estimating community health statistics. However, social media users are
not typically a representative sample of the intended population -- a
"selection bias". Within the social sciences, such a bias is typically
addressed with restratification techniques, where observations are reweighted
according to how under- or over-sampled their socio-demographic groups are.
Yet, restratifaction is rarely evaluated for improving prediction. Across four
tasks of predicting U.S. county population health statistics from Twitter, we
find standard restratification techniques provide no improvement and often
degrade prediction accuracies. The core reasons for this seems to be both
shrunken estimates (reduced variance of model predicted values) and sparse
estimates of each population's socio-demographics. We thus develop and evaluate
three methods to address these problems: estimator redistribution to account
for shrinking, and adaptive binning and informed smoothing to handle sparse
socio-demographic estimates. We show that each of these methods significantly
outperforms the standard restratification approaches. Combining approaches, we
find substantial improvements over non-restratified models, yielding a 53.0%
increase in predictive accuracy (R^2) in the case of surveyed life
satisfaction, and a 17.8% average increase across all tasks
Do children's expectations about future physical activity predict their physical activity in adulthood?
BACKGROUND: Much of the population fails to meet recommended physical activity (PA) levels, but there remains considerable individual variation. By understanding drivers of different trajectories, interventions can be better targeted and more effective. One such driver may be a person's physical activity identity (PAI)-the extent to which a person perceives PA as central to who they are. METHODS: Using survey information and a unique body of essays written at age 11 from the National Child Development Study (Nâ=â10Â 500), essays mentioning PA were automatically identified using the machine learning technique support vector classification and PA trajectories were estimated using latent class analysis. Analyses tested the extent to which childhood PAI correlated with activity levels from age 23 through 55 and with trajectories across adulthood. RESULTS: 42.2% of males and 33.5% of females mentioned PA in their essays, describing active and/or passive engagement. Active PAI in childhood was correlated with higher levels of activity for men but not women, and was correlated with consistently active PA trajectories for both genders. Passive PAI was not related to PA for either gender. CONCLUSIONS: This study offers a novel approach for analysing large qualitative datasets to assess identity and behaviours. Findings suggest that at as young as 11âyears old, the way a young person conceptualizes activity as part of their identity has a lasting association with behaviour. Still, an active identity may require a supportive sociocultural context to manifest in subsequent behaviour
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data
Social media provide access to behavioural data at an unprecedented scale and
granularity. However, using these data to understand phenomena in a broader
population is difficult due to their non-representativeness and the bias of
statistical inference tools towards dominant languages and groups. While
demographic attribute inference could be used to mitigate such bias, current
techniques are almost entirely monolingual and fail to work in a global
environment. We address these challenges by combining multilingual demographic
inference with post-stratification to create a more representative population
sample. To learn demographic attributes, we create a new multimodal deep neural
architecture for joint classification of age, gender, and organization-status
of social media users that operates in 32 languages. This method substantially
outperforms current state of the art while also reducing algorithmic bias. To
correct for sampling biases, we propose fully interpretable multilevel
regression methods that estimate inclusion probabilities from inferred joint
population counts and ground-truth population counts. In a large experiment
over multilingual heterogeneous European regions, we show that our demographic
inference and bias correction together allow for more accurate estimates of
populations and make a significant step towards representative social sensing
in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web
Conference (WWW '19
Intersectional Identities and Machine Learning: Illuminating Language Biases in Twitter Algorithms
Intersectional analysis of social media data is rare. Social media data is ripe for identity and intersectionality analysis with wide accessibility and easy to parse text data yet provides a host of its own methodological challenges regarding the identification of identities. We aggregate Twitter data that was annotated by crowdsourcing for tags of âabusive,â âhateful,â or âspamâ language. Using natural language prediction models, we predict the tweeterâs race and gender and investigate whether these tags for abuse, hate, and spam have a meaningful relationship with the gendered and racialized language predictions. Are certain gender and race groups more likely to be predicted if a tweet is labeled as abusive, hateful, or spam? The findings suggest that certain racial and intersectional groups are more likely to be associated with non-normal language identification. Language consistent with white identity is most likely to be considered within the norm and non-white racial groups are more often linked to hateful, abusive, or spam language
- âŠ