Search CORE

301 research outputs found

Correcting Sociodemographic Selection Biases for Population Prediction from Social Media

Author: Ahmed Farhan
Giorgi Salvatore
Gupta Keshav
Lynn Veronica
Matz Sandra
Schwartz H. Andrew
Ungar Lyle
Publication venue
Publication date: 23/07/2021
Field of study

Social media is increasingly used for large-scale population predictions, such as estimating community health statistics. However, social media users are not typically a representative sample of the intended population -- a "selection bias". Within the social sciences, such a bias is typically addressed with restratification techniques, where observations are reweighted according to how under- or over-sampled their socio-demographic groups are. Yet, restratifaction is rarely evaluated for improving prediction. Across four tasks of predicting U.S. county population health statistics from Twitter, we find standard restratification techniques provide no improvement and often degrade prediction accuracies. The core reasons for this seems to be both shrunken estimates (reduced variance of model predicted values) and sparse estimates of each population's socio-demographics. We thus develop and evaluate three methods to address these problems: estimator redistribution to account for shrinking, and adaptive binning and informed smoothing to handle sparse socio-demographic estimates. We show that each of these methods significantly outperforms the standard restratification approaches. Combining approaches, we find substantial improvements over non-restratified models, yielding a 53.0% increase in predictive accuracy (R^2) in the case of surveyed life satisfaction, and a 17.8% average increase across all tasks

arXiv.org e-Print Archive

PubMed Central

Association for the Advancement of Artificial Intelligence: AAAI Publications

Do children's expectations about future physical activity predict their physical activity in adulthood?

Author: Carpentieri JD
Goodman A
Gupta N
Kern ML
Pongiglione B
Schwartz HA
Publication venue
Publication date: 01/01/2020
Field of study

BACKGROUND: Much of the population fails to meet recommended physical activity (PA) levels, but there remains considerable individual variation. By understanding drivers of different trajectories, interventions can be better targeted and more effective. One such driver may be a person's physical activity identity (PAI)-the extent to which a person perceives PA as central to who they are. METHODS: Using survey information and a unique body of essays written at age 11 from the National Child Development Study (N = 10 500), essays mentioning PA were automatically identified using the machine learning technique support vector classification and PA trajectories were estimated using latent class analysis. Analyses tested the extent to which childhood PAI correlated with activity levels from age 23 through 55 and with trajectories across adulthood. RESULTS: 42.2% of males and 33.5% of females mentioned PA in their essays, describing active and/or passive engagement. Active PAI in childhood was correlated with higher levels of activity for men but not women, and was correlated with consistently active PA trajectories for both genders. Passive PAI was not related to PA for either gender. CONCLUSIONS: This study offers a novel approach for analysing large qualitative datasets to assess identity and behaviours. Findings suggest that at as young as 11 years old, the way a young person conceptualizes activity as part of their identity has a lasting association with behaviour. Still, an active identity may require a supportive sociocultural context to manifest in subsequent behaviour

Archivio istituzionale della Ricerca - Bocconi

UCL Discovery

Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

Author: Alzahrani Sultan
Bergsma Shane
Bethlehem Jelke G
Buolamwini Joy
Chen Xin
Ciot Morgane
Compton Ryan
Goot Rob
Goswami Sumit
Hecht Brent
Huang Gao
Jung Soon-Gyo
Kim Yoon
McCorriston James
Mislove Alan
Nguyen Dong
Nguyen Dong
Rosenthal Sara
Sap Maarten
Schler Jonathan
Zamal Faiyaz Al
Zhang Jinxue
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web Conference (WWW '19

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Universaar

Acronym

Intersectional Identities and Machine Learning: Illuminating Language Biases in Twitter Algorithms

Author: Fitzsimons Aidan
Publication venue: 'HICSS Conference Office'
Publication date: 03/01/2022
Field of study

Intersectional analysis of social media data is rare. Social media data is ripe for identity and intersectionality analysis with wide accessibility and easy to parse text data yet provides a host of its own methodological challenges regarding the identification of identities. We aggregate Twitter data that was annotated by crowdsourcing for tags of “abusive,” “hateful,” or “spam” language. Using natural language prediction models, we predict the tweeter’s race and gender and investigate whether these tags for abuse, hate, and spam have a meaningful relationship with the gendered and racialized language predictions. Are certain gender and race groups more likely to be predicted if a tweet is labeled as abusive, hateful, or spam? The findings suggest that certain racial and intersectional groups are more likely to be associated with non-normal language identification. Language consistent with white identity is most likely to be considered within the norm and non-white racial groups are more often linked to hateful, abusive, or spam language

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

TA-COS 2018 : 2nd Workshop on Text Analytics for Cybersecurity and Online Safety : Proceedings

Author: De Pauw Guy
Desmet Bart
Lefever Els
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2018
Field of study

Ghent University Academic Bibliography