4,537 research outputs found

    Empirical Quantification of Predictive Uncertainty Due to Model Discrepancy by Training with an Ensemble of Experimental Designs: An Application to Ion Channel Kinetics

    Get PDF
    When using mathematical models to make quantitative predictions for clinical or industrial use, it is important that predictions come with a reliable estimate of their accuracy (uncertainty quantification). Because models of complex biological systems are always large simplifications, model discrepancy arises—models fail to perfectly recapitulate the true data generating process. This presents a particular challenge for making accurate predictions, and especially for accurately quantifying uncertainty in these predictions. Experimentalists and modellers must choose which experimental procedures (protocols) are used to produce data used to train models. We propose to characterise uncertainty owing to model discrepancy with an ensemble of parameter sets, each of which results from training to data from a different protocol. The variability in predictions from this ensemble provides an empirical estimate of predictive uncertainty owing to model discrepancy, even for unseen protocols. We use the example of electrophysiology experiments that investigate the properties of hERG potassium channels. Here, ‘information-rich’ protocols allow mathematical models to be trained using numerous short experiments performed on the same cell. In this case, we simulate data with one model and fit it with a different (discrepant) one. For any individual experimental protocol, parameter estimates vary little under repeated samples from the assumed additive independent Gaussian noise model. Yet parameter sets arising from the same model applied to different experiments conflict—highlighting model discrepancy. Our methods will help select more suitable ion channel models for future studies, and will be widely applicable to a range of biological modelling problems

    Bayesian CART models for insurance claims frequency

    Get PDF
    The accuracy and interpretability of a (non-life) insurance pricing model are essential qualities to ensure fair and transparent premiums for policy-holders, that reflect their risk. In recent years, classification and regression trees (CARTs) and their ensembles have gained popularity in the actuarial literature, since they offer good prediction performance and are relatively easy to interpret. In this paper, we introduce Bayesian CART models for insurance pricing, with a particular focus on claims frequency modelling. In addition to the common Poisson and negative binomial (NB) distributions used for claims frequency, we implement Bayesian CART for the zero-inflated Poisson (ZIP) distribution to address the difficulty arising from the imbalanced insurance claims data. To this end, we introduce a general MCMC algorithm using data augmentation methods for posterior tree exploration. We also introduce the deviance information criterion (DIC) for tree model selection. The proposed models are able to identify trees which can better classify the policy-holders into risk groups. Simulations and real insurance data will be used to illustrate the applicability of these models

    Human gut microbes’ transmission, persistence, and contribution to lactose tolerance

    Get PDF
    Human genotypes and their environment interact to produce selectable phenotypes. How microbes of the human gut microbiome interact with their host genotype to shape phenotype is not fully understood. Microbiota that inhabit the human body are environmentally acquired, yet many are passed intergenerationally between related family members, raising the possibility that they could act like genes. Here, I present three studies aimed at better understanding how certain gut microbiota contribute to host phenotypes. In a first study, I assessed mother to child transmission in understudied populations. I collected stool samples from 386 mother-infant pairs in Gabon and Vietnam, which are relatively under-studied for microbiome dynamics, and in Germany. Using metagenomic sequencing I characterized microbial strain diversity. I found that 25-50% of strains detected in mother-infant pairs were shared, and that strain-sharing between unrelated individuals was rare overall. These observations indicate that vertical transmission of microbes is widespread in human populations. Second, to test whether strains acquired during infancy persist into adulthood (similar to human genes), I collected stool from an adolescent previously surveyed for microbiome diversity as an infant. This dataset represents the longest follow-up to date for the persistence of strains seeded in infancy. I observed two strains that had persisted in the gut despite over 10 years passing, as well as 5 additional strains shared between the subject and his parents. Taken together, the results of these first two studies suggest that gut microbial strains persist throughout life and transmit between host-generations, dynamics more similar to those of the host’s own genome than of their environment. Third, I tested whether gut microbes could confer a phenotype (lactose tolerance) to individuals lacking the necessary genotypes (lactase persistence). I studied 784 women in Gabon, Vietnam and Germany for lactase persistence (genotype), lactose tolerance (phenotype), and characterized their gut microbiomes through metagenomic sequencing. Despite the genotype, I observed that 13% of participants were lactose tolerant by clinical criteria; I termed this novel phenotype microbially-acquired lactose tolerance (MALT). Those with MALT harbored microbiomes enriched for Bifidobacteria, a known lactose degrader. These results indicate that Bifidobacteria - which is passed intergenerationally - can confer a phenotype previously thought to be under only host genetic control. Taken together, my thesis work lends weight to the concept that specific microbes inhabiting the human gut have the potential to behave as epigenetic factors in evolution

    The evolution of ectomycorrhizal symbiosis in the Late Cretaceous is a key driver of explosive diversification in Agaricomycetes

    Get PDF
    なぜ、きのこの仲間は多様なのか? --白亜紀の被子植物との出会いが生んだ多様性--. 京都大学プレスリリース. 2023-06-14.Ectomycorrhizal (EcM) symbiosis, a ubiquitous plant–fungus interaction in forests, evolved in parallel in fungi. Why the evolution of EcM fungi did not necessarily increase ecological opportunities for explosive diversification remains unclear. This study aimed to reveal the driving mechanism of the evolutionary diversification in the fungal class Agaricomycetes, specifically by testing whether the evolution of EcM symbiosis in the Late Cretaceous increased ecological opportunities. The historical character transitions of trophic state and fruitbody form were estimated based on phylogenies inferred from fragments of 89 single-copy genes. Moreover, five analyses were used to estimate the net diversification rates (speciation rate minus extinction rate). The results indicate that the unidirectional evolution of EcM symbiosis occurred 27 times, ranging in date from the Early Triassic to the Early Paleogene. The increased diversification rates appeared to occur intensively at the stem of EcM fungal clades diverging in the Late Cretaceous, coinciding with the rapid diversification of EcM angiosperms. By contrast, the evolution of fruitbody form was not strongly linked with the increased diversification rates. These findings suggest that the evolution of EcM symbiosis in the Late Cretaceous, supposedly with coevolving EcM angiosperms, was the key drive of the explosive diversification in Agaricomycetes

    Spatial epidemiology of a highly transmissible disease in urban neighbourhoods: Using COVID-19 outbreaks in Toronto as a case study

    Get PDF
    The emergence of infectious diseases in an urban area involves a complex interaction between the socioecological processes in the neighbourhood and urbanization. As a result, such an urban environment can be the incubator of new epidemics and spread diseases more rapidly in densely populated areas than elsewhere. Most recently, the Coronavirus-19 (COVID-19) pandemic has brought unprecedented challenges around the world. Toronto, the capital city of Ontario, Canada, has been severely impacted by COVID-19. Understanding the spatiotemporal patterns and the key drivers of such patterns is imperative for designing and implementing an effective public health program to control the spread of the pandemic. This dissertation was designed to contribute to the global research effort on the COVID-19 pandemic by conducting spatial epidemiological studies to enhance our understanding of the disease's epidemiology in a spatial context to guide enhancing the public health strategies in controlling the disease. Comprised of three original research manuscripts, this dissertation focuses on the spatial epidemiology of COVID-19 at a neighbourhood scale in Toronto. Each manuscript makes scientific contributions and enhances our knowledge of how interactions between different socioecological processes in the neighbourhood and urbanization can influence spatial spread and patterns of COVID-19 in Toronto with the application of novel and advanced methodological approaches. The findings of the outcomes of the analyses are intended to contribute to the public health policy that informs neighbourhood-based disease intervention initiatives by the public health authorities, local government, and policymakers. The first manuscript analyzes the globally and locally variable socioeconomic drivers of COVID-19 incidence and examines how these relationships vary across different neighbourhoods. In the global model, lower levels of education and the percentage of immigrants were found to have a positive association with increased risk for COVID-19. This study provides the methodological framework for identifying the local variations in the association between risk for COVID-19 and socioeconomic factors in an urban environment by applying a local multiscale geographically weighted regression (MGWR) modelling approach. The MGWR model is an improvement over the methods used in earlier studies of COVID-19 in identifying local variations of COVID-19 by incorporating a correction factor for the multiple testing problem in the geographically weighted regression models. The second manuscript quantifies the associations between COVID-19 cases and urban socioeconomic and land surface temperature (LST) at the neighbourhood scale in Toronto. Four spatiotemporal Bayesian hierarchical models with spatial, temporal, and varying space-time interaction terms are compared. The results of this study identified the seasonal trends of COVID-19 risk, where the spatiotemporal trends show increasing, decreasing, or stable patterns, and identified area-specific spatial risk for targeted interventions. Educational level and high land surface temperature are shown to have a positive association with the risk for COVID-19. In this study, high spatial and temporal resolution satellite images were used to extract LST, and atmospheric corrections methods were applied to these images by adopting a land surface emissivity (LSE) model, which provided a high estimation accuracy. The methodological approach of this work will help researchers understand how to acquire long time-series data of LST at a spatial scale from satellite images, develop methodological approaches for atmospheric correction and create the environmental data with a high estimation accuracy to fit into modelling disease. Applying to policy, the findings of this study can inform the design and implementation of urban planning strategies and programs to control disease risks. The third manuscript developed a novel approach for visualization of the spread of infectious disease outbreaks by incorporating neighbourhood networks and the time-series data of the disease at the neighbourhood level. The findings of the model provide an understanding of the direction and magnitude of spatial risk for the outbreak and guide for the importance of early intervention in order to stop the spread of the outbreak. The manuscript also identified hotspots using incidence rate and disease persistence, the findings of which may inform public health planners to develop priority-based intervention plans in a resource constraint situation

    Developing an fMRI paradigm for studying reinforcement learning with gustatory stimuli

    Get PDF
    One of the main challenges for global public health in the modern world is the rising prevalence of obesity. Obtaining a better understanding of the dysregulated feeding behaviour that leads to obesity, by investigating the decision making and learning processes underlying it, could advance our capabilities in battling the obesity epidemic. Consequently, our aim in this study is to design an experiment that could evaluate these processes. We examined ten healthy participants using a modified version of the "probabilistic selection task". We used gustatory stimuli as a replacement for monetary rewards, to assess the effect of nutritional rewards on the learning behaviour. We subsequently analysed the behavioural results with computational modelling and combined this with imaging data simultaneously acquired with a functional magnetic resonance imaging (fMRI) multiband sequence. All participants in this study succeeded in interpreting and interacting with the gustatory stimuli appropriately. Performance on the task was affected by the subjective valuation of the reward. Participants whose motivation to drink the reward and liking of its taste decreased during the task presented difficulties correctly choosing the more rewarding cues. Computational modelling of the behaviour found that the so-called asymmetric learning model, in which positive and negative reinforcement are differently weighted, best explained the group. The acquired fMRI data was suboptimal and we did not detect the neurological activity we expected in the reward system, which is central to our scientific question. Thus, our study shows it is possible to implement the PST with gustatory stimuli. However, to evaluate the corresponding neurological activity, our fMRI configuration requires improvement. An optimised system could be used in further studies to improve our understanding of the neurobiological mechanisms of learning that lead to obesity and elucidate the role of food as a distinctive reinforcer

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Learning permutation symmetries with gips in R

    Full text link
    The study of hidden structures in data presents challenges in modern statistics and machine learning. We introduce the gips\mathbf{gips} package in R, which identifies permutation subgroup symmetries in Gaussian vectors. gips\mathbf{gips} serves two main purposes: exploratory analysis in discovering hidden permutation symmetries and estimating the covariance matrix under permutation symmetry. It is competitive to canonical methods in dimensionality reduction while providing a new interpretation of the results. gips\mathbf{gips} implements a novel Bayesian model selection procedure within Gaussian vectors invariant under the permutation subgroup introduced in Graczyk, Ishi, Ko{\l}odziejek, Massam, Annals of Statistics, 50 (3) (2022).Comment: 36 pages, 11 figure
    corecore