78 research outputs found
Recommended from our members
Learning from aggregated data
Data aggregation is ubiquitous in modern life. Due to various reasons like privacy, scalability, robustness, etc., ground truth data is often subjected to aggregation before being released to the public, or utilised by researchers and analysts. Learning from aggregated data is a challenging problem that requires significant algorithmic innovation, since naive application of standard techniques to aggregated data is vulnerable to the ecological fallacy. In this work, we explore three different versions of this setting.
First, we tackle the problem of using generalised linear models when features/covariates are fully observed but the targets are only available as histograms- a common scenario in the healthcare domain where many datasets contain both non-sensitive attributes like age, sex, zip-code, etc., as well as privacy sensitive attributes like healthcare records. We introduce an efficient algorithm that uses alternating data imputation and GLM estimation steps to learn predictive models in this setting.
Next, we look at the problem of learning sparse linear models when both features and targets are in aggregated form, specified as empirical estimates of group-wise means computed over different sub-groups of the population. We show that if the true sub-populations are heterogeneous enough, the optimal sparse parameter can be recovered within an arbitrarily small tolerance even in the presence of noise, provided the empirical estimates are obtained from a sufficiently large number of observations.
Third, we tackle the scenario of predictive modelling with data that is subjected to spatio-temporal aggregation. We show that by formulating the problem in the frequency domain, we can bypass the mathematical and representational challenges that arise due to non-uniform aggregation, misaligned sampling periods and aliasing. We introduce a novel algorithm that uses restricted Fourier transforms to estimate a linear model which, when applied to spatio-temporally aggregated data, has a generalisation error that is provably close to the optimal performance by the best possible linear model that can be learned from the non-aggregated data set.
We then focus our attention on the complementary problem that involves designing aggregation strategies that can allow learning, as well as developing algorithmic techniques that can use only the aggregates to train a model that works on individual samples. We motivate our methods by using the example of Gaussian regression, and subsequently extend our techniques to subsume binary classifiers and generalised linear models. We deonstrate the effectiveness of our techniques with empirical evaluation on data from healthcare and telecommunication.
Finally, we present a concrete example of our methods applied to a real life practical problem. Specifically, we consider an application in the domain of online advertising where the complexity of bidding strategies require accurate estimates of most probable cost-per-click or CPC incurred by advertisers, but the data used for training these CPC prediction models are only available as aggregated invoices supplied by an ad publisher on a daily or hourly basis. We introduce a novel learning framework that can use aggregates computed at varying levels of granularity for building individual-level predictive models. We generalise our modelling and algorithmic framework to handle data from diverse domains, and extend our techniques to cover arbitrary aggregation paradigms like sliding windows and overlapping/non-uniform aggregation. We show empirical evidence for the efficacy of our techniques with experiments on both synthetic data and real data from the online advertising domain as well as healthcare to demonstrate the wider applicability of our framework.Electrical and Computer Engineerin
Predicting from aggregated data
Aggregated data, which refers to a collection of data summarized from multiple sources, is a
technique commonly used in different fields of research including healthcare, web application, and
sensor network. Aggregated data is often employed to handle issues such as privacy, scalability,
and reliability. However, accurately predicting individual outcomes from grouped datasets can be
very difficult. In this thesis, we designed a new learning method, a Mixture of Expert (MoE) model,
focused on individual-level prediction when training variables are aggregated. We utilized the MoE
model, trained and validated using the eICU Collaborative Research patient datasets, to conduct
a series of studies. Our results showed that applying grouping functions to the classification of
aggregated data across demographic and behavior metrics could remain effective. This technique
was verified by comparing two separately trained MoE models that were evaluated on the same
datasets. Finally, we estimated non-aggregated datasets from
spatio-temporal aggregated records
by expressing the problem into the frequency domain, and trained an autoregressive model for
predicting future stock prices. This process can be repeated, offering a potential solution to the
issue of learning from aggregated data.Ope
Implementation and Application of Genomic Association Methods to Clostridium Difficile Toxicity and Clinical Infection Outcomes
Clostridium difficile is a major cause of healthcare-associated infections in the United States. A C. difficile infection can lead to a range of outcomes including diarrhea, intensive care unit admission, abdominal surgery, or death. Pathogenesis is mediated by the release of toxin from C. difficile cells growing in the intestines. Some patients are more vulnerable to infection, including those with previous antibiotic exposure and advanced age. Host factors can affect the likelihood of infection but also the severity of infection. Additionally, infection severity can be influenced by the genome of the infecting strain(s). Host-pathogen interactions are extremely complex and very little is known about the interplay between host factors and C. difficile genomic variation with respect to infection likelihood and outcomes. With the recent deluge of whole genome sequencing data, the contribution of bacterial genomic variation to infections can be more comprehensively evaluated than ever before. The work described in this dissertation used two different approaches to test for associations between C. difficile genomic variation and clinically relevant phenotypes.
In the first approach we implemented and applied a novel convergence-based bacterial genome-wide association study (bGWAS) algorithm for quantitative traits. We introduce the algorithm using a set of data generated in silico to realistically model bacterial genome variation and phenotypes under various evolutionary regimes. When the algorithm was applied to C. difficile genomic variants and toxin activity our bGWAS identified known toxin regulatory genes associated with toxin activity, supporting the value of our approach. Besides identifying key cis-regulatory variants in the toxin-producing locus, we observed several associations that connect toxin activity to a complex network of trans-regulatory genes. Many highly associated variants occur in flagellar genes and indicate coregulation of toxicity and motility. We propose new variants associated with toxin activity for future functional validation. This study focused on a complex phenotype, toxin activity, within a highly controlled in vitro system.
We next investigated the impact of bacterial genetic variation on human infections. The increased complexity of this human-pathogen interaction justified a different association approach to better understand the independent contribution of bacterial genomic variation to infection. In a set of clinically derived isolates, we tested for the association between variants in trehalose metabolism operons and infection severity while incorporating and controlling for infection severity-modulating patient characteristics. Trehalose utilization variants were recently proposed to modulate C. difficile infections in a mouse model. Interestingly, we observed that this in vivo result did not translate to our clinical cohort as we found no evidence of an association between any of the trehalose utilization variants and patient infection outcomes. Taken together, these results demonstrate the utility of applying multiple approaches for identifying genomic variants associated with clinical outcomes that account for either bacterial population structure or host factors.PHDMicrobiology & ImmunologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/166125/1/katiephd_1.pd
Timely and reliable evaluation of the effects of interventions: a framework for adaptive meta-analysis (FAME)
Most systematic reviews are retrospective and use aggregate data AD) from publications, meaning they can be unreliable, lag behind therapeutic developments and fail to influence ongoing or new trials. Commonly, the potential influence of unpublished or ongoing trials is overlooked when interpreting results, or determining the value of
updating the meta-analysis or need to collect individual participant data (IPD). Therefore, we developed a Framework for Adaptive Metaanalysis (FAME) to determine prospectively the earliest opportunity for reliable AD meta-analysis. We illustrate FAME using two systematic reviews in men with metastatic (M1) and non-metastatic (M0)hormone-sensitive prostate cancer (HSPC)
Development and use of methods to estimate chronic disease prevalence in small populations
Introduction
National data on the prevalence of chronic diseases on general practice registers is now available. The aim of this PhD was to develop and validate epidemiological models for the expected prevalence of chronic obstructive pulmonary disease (COPD), coronary heart disease (CHD), stroke, hypertension, overall cardiovascular disease (CVD) and high CVD risk at general practice and small area level, and to explore the extent of undiagnosed disease, factors associated with it, and its impact on population health.
Methods
Multinomial logistic regression models were fitted to pooled Health Survey for England data to derive odds ratios for disease risk factors. These were applied to general practice and small area level population data, split by age, sex, ethnicity, deprivation, rurality and smoking status, to estimate expected disease prevalence at these levels. Validation was carried out using external data, including population-based epidemiological research and case-finding initiatives. Practice-level undiagnosed disease prevalence i.e. expected minus registered disease prevalence, and hospital admission rates for these conditions, were evaluated as outcome indicators of the quality and supply of primary health care services, using ordinary least squares (OLS) regression, geographically-weighted regression (GWR), and other spatial analytic methods.
Results
Risk factors, odds of disease and expected prevalence were consistent with external data sources. Spatial analysis showed strong evidence of spatial non-stationarity of undiagnosed disease prevalence, with high levels of undiagnosed disease in London and other conurbations, and associations with low supply of primary health care services. Higher hospital admission rates were associated with population deprivation, poorer quality and supply of primary health care services and poorer access to them, and for COPD, with higher levels of undiagnosed disease.
Conclusion
The epidemiologic prevalence models have been implemented in national data sources such as NHS Comparators, the Association of Public Health Observatories website, and a number of national reports. Early experience suggests that they are useful for guiding case-finding at practice level and improving and regulating the quality of primary health care. Comparisons with external data, in particular prevalence of disease detected by general practices, suggest that model predictions are valid.
Practice-level spatial analyses of undiagnosed disease prevalence and hospital admission rates failed to demonstrate superiority of GWR over OLS methods. Disease modellers should be encouraged to collaborate more effectively, and to validate and compare modelling methods using an agreed framework. National leadership is needed to further develop and implement disease models. It is likely that prevalence models will prove to be most useful for identifying undiagnosed diseases with a slow and insidious onset, such as COPD, diabetes and hypertension
Recommended from our members
Population Dynamics in the Shadow of the Law: A New Approach to Law in Population Studies
Does the Law influence population dynamics such as fertility and family time use patterns? If so, why, and how does this happen? These are the main questions this dissertation tackles. In doing so, a new wholistic approach to the use of Law in population studies is introduced, “Law and Demography”, contributing both theoretical and empirical elements to this scholarly sub-field. A theoretical contribution is made by embedding legal scholarship and theory of legal change into current demographic theory, thereby creating a new analytical space for Law in population studies. An empirical contribution is made by introducing the importance of context into the study of Law and populations. To fully understand the influence of a law, The Law must be considered in its correct topical and spatial context. The topical context is necessary as Law is a patchwork of interlinking edicts that are created and adjusted in relation to each other. Spatial context is crucial, as Law is heavily influenced by its surrounding environment, on the micro (e.g., local regulation), meso (e.g., State Law), and macro levels (e.g., National Constitutions). Three distinct empirical studies employ a different and unique combination of original legal data and socioeconomic measures. Chapter 1 explores the association between State-Level Family Law in the U.S. and later changes to county-level General Fertility Rates; Chapter 2 interrogates the association between Constitutional Law and later changes to country- level Total Fertility Rates; and Chapter 3 studies the association between grandparents’ visitation rights, and time grandparents spend with grandchildren. A solid foundation of evidence is provided by all three studies to demonstrate that Law is linked to population dynamics, as expected by the theoretical framework introduced, affirming a new role for Law in population dynamics, as set out by the “Law and Demography” agenda
Causes and consequences of adult sepsis in Blantyre, Malawi
Sepsis, defined as a life-threatening organ dysfunction triggered by infection, carries a high mortality. Recent improvements in outcome high-income settings have been driven by prompt antimicrobial therapy and fluid resuscitation but mortality remains disproportionately high in low-resource settings like the nations of sub-Saharan Africa (sSA). Sepsis therapy here often consists of empiric, prolonged courses of broad-spectrum antimicrobials, especially third generation cephalosporins like ceftriaxone, which may be driving the rise of ceftriaxone-resistant extended-spectrum ďż˝-lactamase producing Enterobacteriaceae (ESBLE).
However the aetiology of sepsis in sSA is far from clear, and in this thesis I hypothesise that it may be possible to improve outcomes in sepsis whilst reducing selection pressure for ESBL-E, with novel, targeted, antimicrobial strategies tailored to the pathogens that are truly causing sepsis here.
To that end, I present findings from a clinical cohort study of sepsis in Blantyre, Malawi, with two aims: first, a description of the presentation and outcomes of sepsis in Blantyre, with a focus on aetiology and an analysis of the determinants of mortality; and secondly, a description of the gut mucosal carriage of ESBL-E in sepsis survivors (as well as antibiotic unexposed inpatient and community controls) as they pass through the hospital to identify determinants of carriage. An expanded package of diagnostic tests was used to define sepsis aetiology, and serial stool sampling with selective culture for ESBL-E used to define ESBL-E carriage. I use whole-genome sequencing of cultured ESBL E. coli to track bacteria and mobile genetic elements within participants over time, and continuous time Markov models to provide insight into the drivers of carriage.
I find that the majority of participants with sepsis are young, and HIV-infected. Disseminated tuberculosis (TB) dominates as a cause of sepsis, and there is an association of receipt of antituberculous chemotherapy with survival that suggests an expanded role for TB therapy in these very unwell patients may be beneficial. Sepsis mortality seems to have improved compared to historic cohorts, but post 28-day mortality in HIV-infected individuals is significant.
At baseline gut mucosal ESBL-E carriage is common, with cultured ESBL-E present in the stool of 49% of participants with sepsis on the day of admission. There is further rapid increase in colonisation prevalence following admission and antibacterial exposure. Associations of baseline colonisation - household crowding and unprotected water sources - suggest both within-household and environmental routes of transmission are important. Genomic analysis suggest unrestricted mixing of ESBL E. coli at multiple spatial levels and rapid turnover within the individual, perhaps suggestive of frequent re-exposure. By using the genetic environment of ESBL genes as a proxy for mobile genetic elements (which are difficult to assemble from short read sequencing) I show that, within individuals, the E.coli strain-mobile genetic element combination is conserved over time whereas the strain or mobile genetic element alone is not; this suggests that the unit of transmission of ESBL gene to study participants is the bacterium, rather than mobile genetic element.
Longitudinal modelling provides further insight into ESBL-E carriage dynamics: hospitalisation and antibacterial exposure act synergistically to bring about rapid and prolonged carriage driven, in part, by a significant post-antibiotic effect. This effect means that antibacterials act to prolong carriage long after antibacterial exposure stops. In terms of ESBL-E carriage, short courses of antibacterials have a similar effect to longer courses, such that the data generated in this study do not support my hypothesis and it may not be possible to reduce ESBL-E carriage by truncating courses of ceftriaxone. Nevertheless, the post-antibiotic effect deserves further scrutiny to understand the mechanism and as a
potential therapeutic target. In addition, the modelling approach suggests cotrimoxazole preventative therapy (CPT) may be a significant driver of long-term ESBL-E carriage, and I suggest that a more nuanced approach to its deployment may be necessary in an era of increasing Gram-negative resistance
- …