Search CORE

1,068 research outputs found

Correcting Sociodemographic Selection Biases for Population Prediction from Social Media

Author: Ahmed Farhan
Giorgi Salvatore
Gupta Keshav
Lynn Veronica
Matz Sandra
Schwartz H. Andrew
Ungar Lyle
Publication venue
Publication date: 23/07/2021
Field of study

Social media is increasingly used for large-scale population predictions, such as estimating community health statistics. However, social media users are not typically a representative sample of the intended population -- a "selection bias". Within the social sciences, such a bias is typically addressed with restratification techniques, where observations are reweighted according to how under- or over-sampled their socio-demographic groups are. Yet, restratifaction is rarely evaluated for improving prediction. Across four tasks of predicting U.S. county population health statistics from Twitter, we find standard restratification techniques provide no improvement and often degrade prediction accuracies. The core reasons for this seems to be both shrunken estimates (reduced variance of model predicted values) and sparse estimates of each population's socio-demographics. We thus develop and evaluate three methods to address these problems: estimator redistribution to account for shrinking, and adaptive binning and informed smoothing to handle sparse socio-demographic estimates. We show that each of these methods significantly outperforms the standard restratification approaches. Combining approaches, we find substantial improvements over non-restratified models, yielding a 53.0% increase in predictive accuracy (R^2) in the case of surveyed life satisfaction, and a 17.8% average increase across all tasks

arXiv.org e-Print Archive

PubMed Central

Association for the Advancement of Artificial Intelligence: AAAI Publications

A fairness assessment of mobility-based COVID-19 case prediction models

Author: Erfani Abdolmajid
Frias-Martinez Vanessa
Publication venue
Publication date: 07/10/2022
Field of study

In light of the outbreak of COVID-19, analyzing and measuring human mobility has become increasingly important. A wide range of studies have explored spatiotemporal trends over time, examined associations with other variables, evaluated non-pharmacologic interventions (NPIs), and predicted or simulated COVID-19 spread using mobility data. Despite the benefits of publicly available mobility data, a key question remains unanswered: are models using mobility data performing equitably across demographic groups? We hypothesize that bias in the mobility data used to train the predictive models might lead to unfairly less accurate predictions for certain demographic groups. To test our hypothesis, we applied two mobility-based COVID infection prediction models at the county level in the United States using SafeGraph data, and correlated model performance with sociodemographic traits. Findings revealed that there is a systematic bias in models performance toward certain demographic characteristics. Specifically, the models tend to favor large, highly educated, wealthy, young, urban, and non-black-dominated counties. We hypothesize that the mobility data currently used by many predictive models tends to capture less information about older, poorer, non-white, and less educated regions, which in turn negatively impacts the accuracy of the COVID-19 prediction in these regions. Ultimately, this study points to the need of improved data collection and sampling approaches that allow for an accurate representation of the mobility patterns across demographic groups.Comment: 24 pages, 4 figures, 2 Table

arXiv.org e-Print Archive

Michigan Technological University

Aggregate administrative data to adjust selection bias in estimates from nonprobability samples

Author: Cabrera Alvarez Pablo
Publication venue
Publication date: 01/01/2021
Field of study

Tesis por compendio de publicaciones[ES] En los últimos años, la concurrencia de dos fenómenos ha revitalizado el debate metodológico sobre la inferencia a partir de muestras no probabilísticas. Por un lado, las muestras probabilísticas adolecen cada vez más de errores derivados de la no respuesta y la falta de cobertura, lo que aumenta los costes de las encuestas y da lugar a estimaciones sesgadas. Por otro lado, la aparición y la expansión de internet han provocado un creci- miento exponencial del uso de encuestas web con muestras reclutadas mediante métodos no probabilísticos. La inferencia a partir de muestras no probabilísticas requiere un modelo explícito o implícito que explique el mecanismo de selección con respecto a la variable objetivo. Esta tesis explora una intersección entre la necesidad de reducir el sesgo de selec- ción en las estimaciones realizadas a partir de muestras no probabilísticas y la oportunidad de explicar el mecanismo de selección que surge de los nuevos datos administrativos agre- gados disponibles. Para ello, esta tesis engloba tres trabajos que presentan una serie de simulaciones estadísticas y dos aplicaciones metodológicas utilizando un conjunto de en- cuestas presenciales y dos encuestas web realizadas en España. En primer lugar, las simu- laciones estadísticas exploran las condiciones bajo las cuales los datos agregados como variables contextuales y totales poblacionales pueden reducir o eliminar el sesgo de selec- ción de las estimaciones. En segundo lugar, utilizando las encuestas pre y postelectorales del Centro de Investigaciones Sociológicas (CIS) que combinan métodos de selección pro- babilística con cuotas, se explora la adición de variables auxiliares sociodemográficas y recuerdo de voto a la ponderación, así como el uso de técnicas de imputación múltiple para mejorar la calidad de las estimaciones. En tercer lugar, utilizando dos encuestas de un panel experimental de internautas patrocinado por la Asociación para la Investigación de los Me- dios de Comunicación (AIMC), se comprueba el efecto de incluir datos administrativos agregados a nivel municipal para atajar el sesgo de selección y mejorar la calidad de las estimaciones de la encuesta. Los resultados muestran que los datos administrativos agregados son insuficientes para corregir el sesgo de selección en las estimaciones de la encuesta, especialmente cuando se utilizan como variables contextuales. Los resultados también sugieren que la naturaleza agregada de los datos es el principal impedimento para controlar el sesgo de selección en las estimaciones. [EN] In recent years, the concurrence of two phenomena has revitalised the methodolog- ical debate about inference from nonprobability samples. On the one hand, probability samples increasingly suffer from nonresponse and noncoverage errors, increasing survey costs and leading to biased estimates. On the other hand, the emergence and expansion of the Internet have led to an exponential growth in the use of web surveys with samples recruited using nonprobability methods. Inference from nonprobability samples requires an explicit or implicit model that explains the selection mechanism with respect to the tar- get variable. This thesis explores an intersection between the need to reduce selection bias in the estimates from nonprobability samples and the opportunity to explain the selection mech- anism emerging from newly available aggregate administrative data. To this end, this thesis encompasses three papers that present statistical simulations and two methodological ap- plications using a set of face-to-face and two web surveys conducted in Spain. The first paper uses statistical simulations to explore the conditions under which aggregated data as contextual variables and population totals can reduce or remove selection bias from the estimates. The second paper explores adding sociodemographic and past vote auxiliary variables to the weighting as well as using multiple imputation to improve the quality of the estimates using the pre and post-election surveys of the Centro de Investigaciones So- ciológicas (CIS) that combine probability selection methods with quotas. The third article tests the effect of including aggregate administrative data at the municipality level to tackle selection bias and improve the quality of the survey estimates using two surveys from an experimental panel of internet users sponsored by the Association for Media Research (AIMC). The results show that aggregate administrative data is insufficient to correct selec- tion bias in survey estimates, especially when used as contextual variables. The results also suggest that the aggregate nature of the data is the main impediment to control for selection bias in the estimates

Gestion del Repositorio Documental de la Universidad de Salamanca

Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview

Author: Hovy Dirk
Schwartz H. Andrew
Shah Deven
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

An increasing number of works in natural language processing have addressed the effect of bias on the predicted outcomes, introducing mitigation techniques that act on different parts of the standard NLP pipeline (data and models). However, these works have been conducted in isolation, without a unifying framework to organize efforts within the field. This leads to repetitive approaches, and puts an undue focus on the effects of bias, rather than on their origins. Research focused on bias symptoms rather than the underlying origins could limit the development of effective countermeasures. In this paper, we propose a unifying conceptualization: the predictive bias framework for NLP. We summarize the NLP literature and propose a general mathematical definition of predictive bias in NLP along with a conceptual framework, differentiating four main origins of biases: label bias, selection bias, model overamplification, and semantic bias. We discuss how past work has countered each bias origin. Our framework serves to guide an introductory overview of predictive bias in NLP, integrating existing work into a single structure and opening avenues for future research.Comment: 9 pages excluding references, 1 figure, 3 pages for appendi

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della Ricerca - Bocconi

Response styles and the quality of survey data: evidence from Guyana

Author: Thomas Troy
Publication venue: Ghent University. Faculty of Arts and Philosophy
Publication date: 01/01/2014
Field of study

Based on eight chapters, half of which have been published as research papers, this thesis demonstrates the role of cultural factors such as urbanity in response styles of survey data. Although rating scales are quite popular ways of obtaining opinions in surveys, response styles are generally not controlled in data analysis. This has important consequences for the accuracy of research results, as evidenced in the author's Guyanese dataset

Ghent University Academic Bibliography

Leveraging genome-wide data to understand the risk and treatment of common mental disorders

Author: ter Kuile Abigail
Publication venue
Publication date: 01/06/2023
Field of study

King's Research Portal

Measuring COVID-19 Vaccine Hesitancy: Consistency of Social Media with Surveys

Author: Borga Liyousew
Chen Ninghan
Chen Xihui
d'Ambrosio Conchita
Pang Jun
Vögele Claus
Publication venue
Publication date: 12/10/2022
Field of study

Open Repository and Bibliography - Luxembourg

Clinical predictors of antipsychotic treatment resistance: Development and internal validation of a prognostic prediction model by the STRATA-G consortium

Author: Agbedjro Deborah
Ajnakina Olesya
Alameda Luis
Andreassen Ole A.
Barnes Thomas R. E.
Berardi Domenico
Camporesi Sara
Cleusix Martine
Conus Philippe
Crespo-Facorro Benedicto
D'Andrea Giuseppe
Demjaha Arsime
Di Forti Marta
Do Kim
Doody Gillian
Eap Chin B.
Ferchiou Aziz
Pardiñas Antonio F.
Smart Sophie E.
Vázquez Bourgon Javier
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Introduction Our aim was to, firstly, identify characteristics at first-episode of psychosis that are associated with later antipsychotic treatment resistance (TR) and, secondly, to develop a parsimonious prediction model for TR. Methods We combined data from ten prospective, first-episode psychosis cohorts from across Europe and categorised patients as TR or non-treatment resistant (NTR) after a mean follow up of 4.18 years (s.d. = 3.20) for secondary data analysis. We identified a list of potential predictors from clinical and demographic data recorded at first-episode. These potential predictors were entered in two models: a multivariable logistic regression to identify which were independently associated with TR and a penalised logistic regression, which performed variable selection, to produce a parsimonious prediction model. This model was internally validated using a 5-fold, 50-repeat cross-validation optimism-correction. Results Our sample consisted of N = 2216 participants of which 385 (17 %) developed TR. Younger age of psychosis onset and fewer years in education were independently associated with increased odds of developing TR. The prediction model selected 7 out of 17 variables that, when combined, could quantify the risk of being TR better than chance. These included age of onset, years in education, gender, BMI, relationship status, alcohol use, and positive symptoms. The optimism-corrected area under the curve was 0.59 (accuracy = 64 %, sensitivity = 48 %, and specificity = 76 %). Implications Our findings show that treatment resistance can be predicted, at first-episode of psychosis. Pending a model update and external validation, we demonstrate the potential value of prediction models for TR.Funding: This work was supported by a Stratified Medicine Programme grant to JHM from the Medical Research Council (grant number MR/L011794/1 which funded the research and supported S.E.S., D.A., A.F.P, L.K., R.M.M., D.S., J.T.R.W, & J.H.M.); funding from the National Institute for Health Research Biomedical Research Centre at South London and Maudsley National Health Service Foundation Trust and King's College London to D.A. and D.S; and funding from the Collaboration for Leadership in Applied Health Research and Care (CLAHRC) South London at King's College Hospital National Health Service Foundation Trust to S.E.S. The views expressed are those of the author(s) and not necessarily those of the Medical Research Council, National Health Service, the National Institute for Health Research, or the Department of Health. The AESOP (London, UK) cohort was funded by the UK Medical Research Council (Ref: G0500817). The Belfast (UK) cohort was funded by the Research and Development Office of Northern Ireland. The Bologna (Italy) cohort was funded by the European Community's Seventh Framework Program under grant agreement (agreement No.HEALTH-F2-2010–241909, Project EU-GEI). The GAP (London, UK) cohort was funded by the UK National Institute of Health Research(NIHR) Specialist Biomedical Research Centre for Mental Health, South London and Maudsley NHS Mental Health Foundation Trust (SLaM) and the Institute of Psychiatry, Psychology, and Neuroscience at King's College London; Psychiatry Research Trust; Maudsley Charity Research Fund; and the European Community's Seventh Framework Program grant (agreement No. HEALTH-F2-2009-241909, Project EU-GEI). The Lausanne (Switzerland) cohort was funded by the Swiss National Science Foundation (no. 320030_135736/1 to P.C. and K.Q.D., no 320030-120686, 324730-144064 and 320030-173211 to C.B.E and P.C., and no 171804 to LA); National Center of Competence in Research (NCCR) “SYNAPSY - The Synaptic Bases of Mental Diseases” from the Swiss National Science Foundation (no 51AU40_125759 to PC and KQD); and Fondation Alamaya (to KQD). The Oslo (Norway) cohort was funded by the Research Council of Norway (#223273/F50, under the Centers of Excellence funding scheme, #300309, #283798) and the South-Eastern Norway Regional Health Authority (#2006233, #2006258, #2011085, #2014102, #2015088 to IM, #2017-112). The Paris (France) cohort was funded by European Community's Seventh Framework Program grant (agreement No. HEALTH-F2-2010–241909, Project EU-GEI). The Prague (Czech Republic) cohort was funded by the Ministry of Health of the Czech Republic (Grant Number: NU20-04-00393). The Santander (Spain) cohort was funded by the following grants (to B.C.F): Instituto de Salud Carlos III, FIS 00/3095, PI020499, PI050427, PI060507, Plan Nacional de Drogas Research Grant 2005-Orden sco/3246/2004, and SENY Fundatio Research Grant CI 2005-0308007, Fundacion Marques de Valdecilla A/02/07 and API07/011. SAF2016-76046-R and SAF2013-46292-R (MINECO and FEDER). The West London (UK) cohort was funded The Wellcome Trust (Grant Number: 042025; 052247; 064607)

UCrea

Predictive Model of the Risk of In-Hospital Mortality in Colorectal Cancer Surgery, Based on the Minimum Basic Data Set

Author: García Torrecillas Juan Manuel
Sánchez María José
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

Background: Various models have been proposed to predict mortality rates for hospital patients undergoing colorectal cancer surgery. However, none have been developed in Spain using clinical administrative databases and none are based exclusively on the variables available upon admission. Our study aim is to detect factors associated with in-hospital mortality in patients undergoing surgery for colorectal cancer and, on this basis, to generate a predictive mortality score. Methods: A population cohort for analysis was obtained as all hospital admissions for colorectal cancer during the period 2008–2014, according to the Spanish Minimum Basic Data Set. The main measure was actual and expected mortality after the application of the considered mathematical model. A logistic regression model and a mortality score were created, and internal validation was performed. Results: 115,841 hospitalization episodes were studied. Of these, 80% were included in the training set. The variables associated with in-hospital mortality were age (OR: 1.06, 95%CI: 1.05–1.06), urgent admission (OR: 4.68, 95% CI: 4.36–5.02), pulmonary disease (OR: 1.43, 95%CI: 1.28–1.60), stroke (OR: 1.87, 95%CI: 1.53–2.29) and renal insufficiency (OR: 7.26, 95%CI: 6.65–7.94). The level of discrimination (area under the curve) was 0.83. Conclusions: This mortality model is the first to be based on administrative clinical databases and hospitalization episodes. The model achieves a moderate–high level of discrimination.Carlos III Institute of Health, Madrid (Spain) under the 2013-2016 National Plan for RDI PI16/01931ISCIII-General Subdirectorate for Evaluation and Promotion of Research, within the European Regional Development Fund (FEDER

Repositorio Institucional Universidad de Granada