20 research outputs found
A Predictive Approach to Bayesian Nonparametric Survival Analysis
Bayesian nonparametric methods are a popular choice for analysing survival data due to their ability to flexibly model the distribution of survival times. These methods typically employ a nonparametric prior on the survival function that is conjugate with respect to right-censored data. Eliciting these priors, particularly in the presence of covariates, can be challenging and inference typically relies on computationally intensive Markov chain Monte Carlo schemes. In this paper, we build on recent work that recasts Bayesian inference as assigning a predictive distribution on the unseen values of a population conditional on the observed samples, thus avoiding the need to specify a complex prior. We describe a copula-based predictive update which admits a scalable sequential importance sampling algorithm to perform inference that properly accounts for right-censoring. We provide theoretical justification through an extension of Doob’s consistency theorem and illustrate the method on a number of simulated and real data sets, including an example with covariates. Our approach enables analysts to perform Bayesian nonparametric inference through only the specification of a predictive distribution
Optimal strategies for learning multi-ancestry polygenic scores vary across traits
Polygenic scores (PGSs) are individual-level measures that aggregate the genome-wide genetic predisposition to a given trait. As PGS have predominantly been developed using European-ancestry samples, trait prediction using such European ancestry-derived PGS is less accurate in non-European ancestry individuals. Although there has been recent progress in combining multiple PGS trained on distinct populations, the problem of how to maximize performance given a multiple-ancestry cohort is largely unexplored. Here, we investigate the effect of sample size and ancestry composition on PGS performance for fifteen traits in UK Biobank. For some traits, PGS estimated using a relatively small African-ancestry training set outperformed, on an African-ancestry test set, PGS estimated using a much larger European-ancestry only training set. We observe similar, but not identical, results when considering other minority-ancestry groups within UK Biobank. Our results emphasise the importance of targeted data collection from underrepresented groups in order to address existing disparities in PGS performance
Recommended from our members
Inferring differences between networks using Bayesian exponential random graph models
The goal of many neuroimaging studies is to better understand how the functional connectivity structure of the brain changes with a given phenotype such as age. Functional connectivity can be characterised as a network, with nodes corresponding to brain regions and edges corresponding to statistical dependencies between the respective regional time series of activity. A typical neuroimaging dataset will thus consist of one or more networks for each individual in the study. Most statistical network models, however, were originally proposed to describe a single underlying relational structure such as friendships between individuals or hyperlinks between web pages. As a result, the development of these models has largely been restricted to the single network case. While one could in principle fit a single network model to each individual separately, it is not always straightforward to combine these individual results into a single group result.
In the first half of the thesis, we propose a multilevel framework for populations of networks based on exponential random graph models. By pooling information across the individual networks, this framework provides a principled approach to characterise the relational structure for an entire population. We use the framework to assess group-level variations in functional connectivity, providing a method for the inference of differences in the topological structure between groups of networks. Our motivation stems from the Cam-CAN project, a neuroimaging study on healthy ageing. Using this dataset, we illustrate how our method can be used to detect differences in functional connectivity between a group of young individuals and a group of old individuals.
In the second half of the thesis, we shift our focus to dynamic functional connectivity (dFC). Recent studies have found that using static measures may average over informative fluctuations in functional connectivity. Several methods have been developed to measure dFC in functional magnetic resonance imaging (fMRI) data. However, spurious group differences in measured dFC may be caused by other sources of heterogeneity between people. We use a generic simulation framework for fMRI data to investigate the effect of such heterogeneity on estimates of dFC and find that, despite no differences in true dFC, individual differences in measured dFC can result from other (non-dynamic) features of the data. We then add a natural and novel extension to our multilevel framework by inserting time windows as an intermediate level between time points and subjects. Using magnetoencephalography data from the Cam-CAN study, we apply our method to detect differences in time-varying connectivity between a young group and an old group
Neural Score Matching for High-Dimensional Causal Inference
Traditional methods for matching in causal
inference are impractical for high-dimensional
datasets. They suffer from the curse of dimensionality: exact matching and coarsened exact
matching find exponentially fewer matches
as the input dimension grows, and propensity score matching may match highly unrelated units together. To overcome this problem, we develop theoretical results which motivate the use of neural networks to obtain
non-trivial, multivariate balancing scores of a
chosen level of coarseness, in contrast to the
classical, scalar propensity score. We leverage
these balancing scores to perform matching
for high-dimensional causal inference and call
this procedure neural score matching. We
show that our method is competitive against
other matching approaches on semi-synthetic
high-dimensional datasets, both in terms of
treatment effect estimation and reducing imbalanc
Bayesian imputation of COVID-19 positive test counts for nowcasting under reporting lag
Obtaining up to date information on the number of UK COVID-19 regional
infections is hampered by the reporting lag in positive test results for people
with COVID-19 symptoms. In the UK, for "Pillar 2" swab tests for those showing
symptoms, it can take up to five days for results to be collated. We make use
of the stability of the under reporting process over time to motivate a
statistical temporal model that infers the final total count given the partial
count information as it arrives. We adopt a Bayesian approach that provides for
subjective priors on parameters and a hierarchical structure for an underlying
latent intensity process for the infection counts. This results in a smoothed
time-series representation now-casting the expected number of daily counts of
positive tests with uncertainty bands that can be used to aid decision making.
Inference is performed using sequential Monte Carlo
Bayesian imputation of COVID-19 positive test counts for nowcasting under reporting lag
Obtaining up to date information on the number of UK COVID-19 regional infections is hampered by the reporting lag in positive test results for people with COVID-19 symptoms. In the UK, for ‘Pillar 2’ swab tests for those showing symptoms, it can take up to five days for results to be collated. We make use of the stability of the under reporting process over time to motivate a statistical temporal model that infers the final total count given the partial count information as it arrives. We adopt a Bayesian approach that provides for subjective priors on parameters and a hierarchical structure for an underlying latent intensity process for the infection counts. This results in a smoothed time-series representation nowcasting the expected number of daily counts of positive tests with uncertainty bands that can be used to aid decision making. Inference is performed using sequential Monte Carlo
Improving local prevalence estimates of SARS-CoV-2 infections using a causal debiasing framework.
Funder: Oxford University | Jesus College, University of OxfordFunder: Joint Biosecurity CentreGlobal and national surveillance of SARS-CoV-2 epidemiology is mostly based on targeted schemes focused on testing individuals with symptoms. These tested groups are often unrepresentative of the wider population and exhibit test positivity rates that are biased upwards compared with the true population prevalence. Such data are routinely used to infer infection prevalence and the effective reproduction number, Rt, which affects public health policy. Here, we describe a causal framework that provides debiased fine-scale spatiotemporal estimates by combining targeted test counts with data from a randomized surveillance study in the United Kingdom called REACT. Our probabilistic model includes a bias parameter that captures the increased probability of an infected individual being tested, relative to a non-infected individual, and transforms observed test counts to debiased estimates of the true underlying local prevalence and Rt. We validated our approach on held-out REACT data over a 7-month period. Furthermore, our local estimates of Rt are indicative of 1-week- and 2-week-ahead changes in SARS-CoV-2-positive case numbers. We also observed increases in estimated local prevalence and Rt that reflect the spread of the Alpha and Delta variants. Our results illustrate how randomized surveys can augment targeted testing to improve statistical accuracy in monitoring the spread of emerging and ongoing infectious disease
Recommended from our members
Improving local prevalence estimates of SARS-CoV-2 infections using a causal debiasing framework.
Funder: Oxford University | Jesus College, University of OxfordFunder: Joint Biosecurity CentreGlobal and national surveillance of SARS-CoV-2 epidemiology is mostly based on targeted schemes focused on testing individuals with symptoms. These tested groups are often unrepresentative of the wider population and exhibit test positivity rates that are biased upwards compared with the true population prevalence. Such data are routinely used to infer infection prevalence and the effective reproduction number, Rt, which affects public health policy. Here, we describe a causal framework that provides debiased fine-scale spatiotemporal estimates by combining targeted test counts with data from a randomized surveillance study in the United Kingdom called REACT. Our probabilistic model includes a bias parameter that captures the increased probability of an infected individual being tested, relative to a non-infected individual, and transforms observed test counts to debiased estimates of the true underlying local prevalence and Rt. We validated our approach on held-out REACT data over a 7-month period. Furthermore, our local estimates of Rt are indicative of 1-week- and 2-week-ahead changes in SARS-CoV-2-positive case numbers. We also observed increases in estimated local prevalence and Rt that reflect the spread of the Alpha and Delta variants. Our results illustrate how randomized surveys can augment targeted testing to improve statistical accuracy in monitoring the spread of emerging and ongoing infectious disease
Interoperability of Statistical Models in Pandemic Preparedness: Principles and Reality
We present "interoperability" as a guiding framework for statistical
modelling to assist policy makers asking multiple questions using diverse
datasets in the face of an evolving pandemic response. Interoperability
provides an important set of principles for future pandemic preparedness,
through the joint design and deployment of adaptable systems of statistical
models for disease surveillance using probabilistic reasoning. We illustrate
this through case studies for inferring spatial-temporal coronavirus disease
2019 (COVID-19) prevalence and reproduction numbers in England