1,068 research outputs found
Correcting Sociodemographic Selection Biases for Population Prediction from Social Media
Social media is increasingly used for large-scale population predictions,
such as estimating community health statistics. However, social media users are
not typically a representative sample of the intended population -- a
"selection bias". Within the social sciences, such a bias is typically
addressed with restratification techniques, where observations are reweighted
according to how under- or over-sampled their socio-demographic groups are.
Yet, restratifaction is rarely evaluated for improving prediction. Across four
tasks of predicting U.S. county population health statistics from Twitter, we
find standard restratification techniques provide no improvement and often
degrade prediction accuracies. The core reasons for this seems to be both
shrunken estimates (reduced variance of model predicted values) and sparse
estimates of each population's socio-demographics. We thus develop and evaluate
three methods to address these problems: estimator redistribution to account
for shrinking, and adaptive binning and informed smoothing to handle sparse
socio-demographic estimates. We show that each of these methods significantly
outperforms the standard restratification approaches. Combining approaches, we
find substantial improvements over non-restratified models, yielding a 53.0%
increase in predictive accuracy (R^2) in the case of surveyed life
satisfaction, and a 17.8% average increase across all tasks
A fairness assessment of mobility-based COVID-19 case prediction models
In light of the outbreak of COVID-19, analyzing and measuring human mobility
has become increasingly important. A wide range of studies have explored
spatiotemporal trends over time, examined associations with other variables,
evaluated non-pharmacologic interventions (NPIs), and predicted or simulated
COVID-19 spread using mobility data. Despite the benefits of publicly available
mobility data, a key question remains unanswered: are models using mobility
data performing equitably across demographic groups? We hypothesize that bias
in the mobility data used to train the predictive models might lead to unfairly
less accurate predictions for certain demographic groups. To test our
hypothesis, we applied two mobility-based COVID infection prediction models at
the county level in the United States using SafeGraph data, and correlated
model performance with sociodemographic traits. Findings revealed that there is
a systematic bias in models performance toward certain demographic
characteristics. Specifically, the models tend to favor large, highly educated,
wealthy, young, urban, and non-black-dominated counties. We hypothesize that
the mobility data currently used by many predictive models tends to capture
less information about older, poorer, non-white, and less educated regions,
which in turn negatively impacts the accuracy of the COVID-19 prediction in
these regions. Ultimately, this study points to the need of improved data
collection and sampling approaches that allow for an accurate representation of
the mobility patterns across demographic groups.Comment: 24 pages, 4 figures, 2 Table
Aggregate administrative data to adjust selection bias in estimates from nonprobability samples
Tesis por compendio de publicaciones[ES] En los últimos años, la concurrencia de dos fenómenos ha revitalizado el debate
metodológico sobre la inferencia a partir de muestras no probabilísticas. Por un lado, las
muestras probabilísticas adolecen cada vez más de errores derivados de la no respuesta y
la falta de cobertura, lo que aumenta los costes de las encuestas y da lugar a estimaciones
sesgadas. Por otro lado, la aparición y la expansión de internet han provocado un creci-
miento exponencial del uso de encuestas web con muestras reclutadas mediante métodos
no probabilísticos. La inferencia a partir de muestras no probabilísticas requiere un modelo
explícito o implícito que explique el mecanismo de selección con respecto a la variable
objetivo.
Esta tesis explora una intersección entre la necesidad de reducir el sesgo de selec-
ción en las estimaciones realizadas a partir de muestras no probabilísticas y la oportunidad
de explicar el mecanismo de selección que surge de los nuevos datos administrativos agre-
gados disponibles. Para ello, esta tesis engloba tres trabajos que presentan una serie de
simulaciones estadísticas y dos aplicaciones metodológicas utilizando un conjunto de en-
cuestas presenciales y dos encuestas web realizadas en España. En primer lugar, las simu-
laciones estadísticas exploran las condiciones bajo las cuales los datos agregados como
variables contextuales y totales poblacionales pueden reducir o eliminar el sesgo de selec-
ción de las estimaciones. En segundo lugar, utilizando las encuestas pre y postelectorales
del Centro de Investigaciones Sociológicas (CIS) que combinan métodos de selección pro-
babilística con cuotas, se explora la adición de variables auxiliares sociodemográficas y
recuerdo de voto a la ponderación, así como el uso de técnicas de imputación múltiple para
mejorar la calidad de las estimaciones. En tercer lugar, utilizando dos encuestas de un panel
experimental de internautas patrocinado por la Asociación para la Investigación de los Me-
dios de Comunicación (AIMC), se comprueba el efecto de incluir datos administrativos
agregados a nivel municipal para atajar el sesgo de selección y mejorar la calidad de las
estimaciones de la encuesta.
Los resultados muestran que los datos administrativos agregados son insuficientes
para corregir el sesgo de selección en las estimaciones de la encuesta, especialmente
cuando se utilizan como variables contextuales. Los resultados también sugieren que la
naturaleza agregada de los datos es el principal impedimento para controlar el sesgo de
selección en las estimaciones.
[EN] In recent years, the concurrence of two phenomena has revitalised the methodolog-
ical debate about inference from nonprobability samples. On the one hand, probability
samples increasingly suffer from nonresponse and noncoverage errors, increasing survey
costs and leading to biased estimates. On the other hand, the emergence and expansion of
the Internet have led to an exponential growth in the use of web surveys with samples
recruited using nonprobability methods. Inference from nonprobability samples requires
an explicit or implicit model that explains the selection mechanism with respect to the tar-
get variable.
This thesis explores an intersection between the need to reduce selection bias in the
estimates from nonprobability samples and the opportunity to explain the selection mech-
anism emerging from newly available aggregate administrative data. To this end, this thesis
encompasses three papers that present statistical simulations and two methodological ap-
plications using a set of face-to-face and two web surveys conducted in Spain. The first
paper uses statistical simulations to explore the conditions under which aggregated data as
contextual variables and population totals can reduce or remove selection bias from the
estimates. The second paper explores adding sociodemographic and past vote auxiliary
variables to the weighting as well as using multiple imputation to improve the quality of
the estimates using the pre and post-election surveys of the Centro de Investigaciones So-
ciológicas (CIS) that combine probability selection methods with quotas. The third article
tests the effect of including aggregate administrative data at the municipality level to tackle
selection bias and improve the quality of the survey estimates using two surveys from an
experimental panel of internet users sponsored by the Association for Media Research
(AIMC).
The results show that aggregate administrative data is insufficient to correct selec-
tion bias in survey estimates, especially when used as contextual variables. The results also suggest that the aggregate nature of the data is the main impediment to control for selection
bias in the estimates
Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview
An increasing number of works in natural language processing have addressed
the effect of bias on the predicted outcomes, introducing mitigation techniques
that act on different parts of the standard NLP pipeline (data and models).
However, these works have been conducted in isolation, without a unifying
framework to organize efforts within the field. This leads to repetitive
approaches, and puts an undue focus on the effects of bias, rather than on
their origins. Research focused on bias symptoms rather than the underlying
origins could limit the development of effective countermeasures. In this
paper, we propose a unifying conceptualization: the predictive bias framework
for NLP. We summarize the NLP literature and propose a general mathematical
definition of predictive bias in NLP along with a conceptual framework,
differentiating four main origins of biases: label bias, selection bias, model
overamplification, and semantic bias. We discuss how past work has countered
each bias origin. Our framework serves to guide an introductory overview of
predictive bias in NLP, integrating existing work into a single structure and
opening avenues for future research.Comment: 9 pages excluding references, 1 figure, 3 pages for appendi
Response styles and the quality of survey data: evidence from Guyana
Based on eight chapters, half of which have been published as research papers, this thesis demonstrates the role of cultural factors such as urbanity in response styles of survey data. Although rating scales are quite popular ways of obtaining opinions in surveys, response styles are generally not controlled in data analysis. This has important consequences for the accuracy of research results, as evidenced in the author's Guyanese dataset
Clinical predictors of antipsychotic treatment resistance: Development and internal validation of a prognostic prediction model by the STRATA-G consortium
Introduction
Our aim was to, firstly, identify characteristics at first-episode of psychosis that are associated with later antipsychotic treatment resistance (TR) and, secondly, to develop a parsimonious prediction model for TR.
Methods
We combined data from ten prospective, first-episode psychosis cohorts from across Europe and categorised patients as TR or non-treatment resistant (NTR) after a mean follow up of 4.18 years (s.d. = 3.20) for secondary data analysis. We identified a list of potential predictors from clinical and demographic data recorded at first-episode. These potential predictors were entered in two models: a multivariable logistic regression to identify which were independently associated with TR and a penalised logistic regression, which performed variable selection, to produce a parsimonious prediction model. This model was internally validated using a 5-fold, 50-repeat cross-validation optimism-correction.
Results
Our sample consisted of N = 2216 participants of which 385 (17 %) developed TR. Younger age of psychosis onset and fewer years in education were independently associated with increased odds of developing TR. The prediction model selected 7 out of 17 variables that, when combined, could quantify the risk of being TR better than chance. These included age of onset, years in education, gender, BMI, relationship status, alcohol use, and positive symptoms. The optimism-corrected area under the curve was 0.59 (accuracy = 64 %, sensitivity = 48 %, and specificity = 76 %).
Implications
Our findings show that treatment resistance can be predicted, at first-episode of psychosis. Pending a model update and external validation, we demonstrate the potential value of prediction models for TR.Funding: This work was supported by a Stratified Medicine Programme grant to JHM from the Medical Research Council (grant number MR/L011794/1 which funded the research and supported S.E.S., D.A., A.F.P, L.K., R.M.M., D.S., J.T.R.W, & J.H.M.); funding from the National Institute for Health Research Biomedical Research Centre at South London and Maudsley National Health Service Foundation Trust and King's College London to D.A. and D.S; and funding from the Collaboration for Leadership in Applied Health Research and Care (CLAHRC) South London at King's College Hospital National Health Service Foundation Trust to S.E.S. The views expressed are those of the author(s) and not necessarily those of the Medical Research Council, National Health Service, the National Institute for Health Research, or the Department of Health. The AESOP (London, UK) cohort was funded by the UK Medical Research Council (Ref: G0500817). The Belfast (UK) cohort was funded by the Research and Development Office of Northern Ireland. The Bologna (Italy) cohort was funded by the European Community's Seventh Framework Program under grant agreement (agreement No.HEALTH-F2-2010–241909, Project EU-GEI). The GAP (London, UK) cohort was funded by the UK National Institute of Health Research(NIHR) Specialist Biomedical Research Centre for Mental Health, South London and Maudsley NHS Mental Health Foundation Trust (SLaM) and the Institute of Psychiatry, Psychology, and Neuroscience at King's
College London; Psychiatry Research Trust; Maudsley Charity Research Fund; and the European Community's Seventh Framework Program grant (agreement No. HEALTH-F2-2009-241909, Project EU-GEI). The Lausanne (Switzerland) cohort was funded by the Swiss National Science Foundation (no. 320030_135736/1 to P.C. and K.Q.D., no 320030-120686, 324730-144064 and 320030-173211 to C.B.E and P.C., and no 171804 to LA); National Center of Competence in Research (NCCR) “SYNAPSY - The Synaptic Bases of Mental Diseases” from the Swiss National Science Foundation (no 51AU40_125759 to PC and KQD); and Fondation Alamaya (to KQD). The Oslo (Norway) cohort was funded by the Research Council of Norway (#223273/F50, under the Centers of Excellence funding scheme, #300309, #283798) and the South-Eastern Norway Regional Health Authority (#2006233, #2006258, #2011085, #2014102, #2015088 to IM, #2017-112). The Paris (France) cohort was funded by European Community's Seventh Framework Program grant (agreement No. HEALTH-F2-2010–241909, Project EU-GEI). The Prague (Czech Republic) cohort was funded by the Ministry of Health of the Czech Republic (Grant Number: NU20-04-00393). The Santander (Spain) cohort was funded by the following grants (to B.C.F): Instituto de Salud Carlos III, FIS 00/3095, PI020499, PI050427, PI060507, Plan Nacional de Drogas Research Grant 2005-Orden sco/3246/2004, and SENY Fundatio Research Grant CI 2005-0308007, Fundacion Marques de Valdecilla A/02/07 and API07/011. SAF2016-76046-R and SAF2013-46292-R (MINECO and FEDER). The West London (UK) cohort was funded The Wellcome Trust (Grant Number: 042025; 052247; 064607)
Predictive Model of the Risk of In-Hospital Mortality in Colorectal Cancer Surgery, Based on the Minimum Basic Data Set
Background: Various models have been proposed to predict mortality rates for hospital
patients undergoing colorectal cancer surgery. However, none have been developed in Spain using
clinical administrative databases and none are based exclusively on the variables available upon
admission. Our study aim is to detect factors associated with in-hospital mortality in patients
undergoing surgery for colorectal cancer and, on this basis, to generate a predictive mortality score.
Methods: A population cohort for analysis was obtained as all hospital admissions for colorectal
cancer during the period 2008–2014, according to the Spanish Minimum Basic Data Set. The main
measure was actual and expected mortality after the application of the considered mathematical
model. A logistic regression model and a mortality score were created, and internal validation was
performed. Results: 115,841 hospitalization episodes were studied. Of these, 80% were included
in the training set. The variables associated with in-hospital mortality were age (OR: 1.06, 95%CI:
1.05–1.06), urgent admission (OR: 4.68, 95% CI: 4.36–5.02), pulmonary disease (OR: 1.43, 95%CI:
1.28–1.60), stroke (OR: 1.87, 95%CI: 1.53–2.29) and renal insufficiency (OR: 7.26, 95%CI: 6.65–7.94).
The level of discrimination (area under the curve) was 0.83. Conclusions: This mortality model is
the first to be based on administrative clinical databases and hospitalization episodes. The model
achieves a moderate–high level of discrimination.Carlos III Institute of Health, Madrid (Spain) under the 2013-2016 National Plan for RDI
PI16/01931ISCIII-General Subdirectorate for Evaluation and Promotion of Research, within the European Regional Development Fund (FEDER
- …