66 research outputs found

    Compositional Datasets and the Nested Dirichlet Distribution

    Get PDF
    Compositional data is a type of multivariate data where each component of a vector is sandwiched between 0 and 1 and the sum of the components is 1. For example, the proportion of time that each of 7 mice spend in one of four quadrants of a circular water maze is between 0 and 1, and the total proportion of time spent in the maze is 1. If there are two sets of mice, one set of normal mice and one set of cognitively impaired mice, the experiment has a two-sample design. Such data is frequently analyzed incorrectly by comparing the two samples via a t-test (or ANOVA for multiple samples) on one component of the vector at a time. This problem is corrected by analyzing compositional datasets using nested Dirichlet distributions, generalized versions of Dirichlet distributions that allow for positive correlations among components. Specifically, we extend a previous result of two-sample comparisons using Dirichlet distributions and nested Dirichlet distributions to multi-sample comparisons. The performance of the new test in terms of type I error rates and power is established using simulation studies. In addition, to use a nested model, an appropriate tree which describes the relationship between components must first be found. An existing data driven tree finding algorithm is improved upon by including an extra step that prunes unnecessary nodes using confidence intervals for the differences between parameters at each level of the tree. The tree finding algorithm and multi-sample test are demonstrated on two datasets

    Benchmarking of differential abundance methods and development of bioinformatics and statistical tools for metagenomics data analysis

    Get PDF
    L'analisi di dati nell'ambito del microbioma e della metagenomica è stato il tema principale del mio dottorato. L'obiettivo primario di questa tesi si muove attorno all'osservazione dei limiti dei metodi per lo studio dell'abbondanza differenziale e culmina con la creazione di un framework analitico che permette la loro misurazione e comparazione. Come obiettivo secondario, inoltre, la tesi vuole enfatizzare la necessità di una solida analisi statistica esplorativa ed inferenziale nei dati di metabarcoding, tramite la presentazione di alcuni casi studio. Inizio presentando 2 studi strettamente collegati in cui i metodi per l'analisi di abbondanza differenziale sono i protagonisti. L'analisi di abbondanza differenziale è lo strumento principale per individuare differenze nelle composizioni delle comunità microbiche in gruppi di campioni di diversa provenienza. Rappresenta quindi il primo passo per la comprensione delle comunità microbiche, delle relazioni tra i loro membri e di questi con l'ambiente. Il primo studio riguarda un lavoro di confronto tra metodi. A partire da una collezione di dataset metagenomici, l'obiettivo era di valutare le performance di metodi per l'analisi dell'abbondanza differenziale, anche nati in altri ambiti di ricerca (e.g., RNA-Seq e single-cell RNA-Seq). Invece, con il secondo studio presento un software che ho sviluppato grazie ai risultati ottenuti dalla precedente ricerca. Attualmente, il pacchetto software, in linguaggio R, è disponibile su Bioconductor (i.e., una piattaforma open-source per l'analisi e la visualizzazione di dati biologici). Esso consente agli utenti di replicare sui propri dataset il confronto tra metodi per lo studio dell'abbondanza differenziale e la conseguente analisi delle performance. Infine, mostro alcune delle sfide che ho incontrato nell'analisi di questo tipo di dato attraverso 2 casi studio riguardanti il microbioma umano, la sua composizione e dinamica, sia in stato di salute che malattia. Nel primo studio, dei soggetti sani sono stati trattati con una mistura di probiotici per valutare variazioni del microbiota intestinale ed eventuali associazioni con alcuni aspetti psicologici. Un'attenta analisi esplorativa, l'impiego di tecniche di clustering e l'utilizzo di modelli di regressione lineare ad effetti misti hanno consentito di svelare un forte effetto soggetto-specifico e la presenza di diversi batteriotipi di partenza che mascheravano l'effetto complessivo del trattamento probiotico. Invece, nel secondo studio mostro come, a partire da campioni salivari, sono stati individuati dei biomarcatori associati all'esofagite eosinofila (i.e., una malattia cronica immuno-mediata a carico dell'esofago che causa disfagia, occlusioni e stenosi esofagee). Nonostante la bassa numerosità campionaria è stato possibile costruire un modello per discriminare tra casi e controlli con una buona accuratezza. Anche se ancora prematuro, questo risultato rappresenta un passo promettente verso la diagnosi non invasiva di questa malattia che per il momento viene fatta solo tramite biopsia esofagea.Microbiome and metagenomics data analysis has been the main theme of my PhD programme. As a main goal, the thesis moves from the observed limitations of the differential abundance analysis tools to a benchmark and a framework against which they could be measured and compared. Furthermore, as a secondary goal, the presentation of some case studies wants to emphasise the need for a sound exploratory and inferential statistical analysis in metabarcoding data. Firstly, I present two closely related studies in which differential abundance analysis methods play the main role. The differential abundance analysis is the principal approach to detect differences in microbial community compositions between different sample groups, and hence, for understanding microbial community structures and the relationships between microbial compositions and the environment. I start by introducing a benchmarking study in which differential abundance analysis methods, even from different domains (e.g., RNA-Seq and single-cell RNA-Seq), were used in a collection of microbiome datasets to evaluate their performance. Then, I continue with the presentation of software package that I developed from the results obtained in the previous research. The software package, in R language, is currently available on Bioconductor (i.e., an open-source software platform for analysing and visualising biological data). It allows users to replicate the benchmarking of differential abundance analysis methods and evalute their performances on their own datasets. Secondly, I highlight the microbiome data analysis challenges presenting two case studies about the human microbiome and its composition and dynamics in both disease and healthy states. In the first study, healthy volunteers were treated with a probiotic mixture and the changes in the gut microbiome were studied in conjunction with some psychological aspects. A careful data exploration, clustering, and mixed-effects regression models, unveiled subject-specific effects and the presence of different bacteriotypes which masked the probiotic effect. Instead, in the second study I show how to identify disease-related microbial biomarkers for eosinophilic oesophagitis (i.e., a chronic immune-mediated inflammatory disease of the oesophagus that causes dysphagia, food impaction of the oesophagus, and esophageal strictures) from saliva. Despite the low sample size it was possible to train a model to discriminate between case and control states with a decent accuracy. While still premature, this represents a promising step for the non-invasive diagnosis of eosinophilic oesophagitis which is now possible only through esophageal biopsy

    Non-parametric machine learning for biological sequence data

    Get PDF
    In the past decade there has been a massive increase in the volume of biological sequence data, driven by massively parallel sequencing technologies. This has enabled data-driven statistical analyses using non-parametric predictive models (including those from machine learning) to complement more traditional, hypothesis-driven approaches. This thesis addresses several challenges that arise when applying non-parametric predictive models to biological sequence data. Some of these challenges arise due to the nature of the biological system of interest. For example, in the study of the human microbiome the phylogenetic relationships between microorganisms are often ignored in statistical analyses. This thesis outlines a novel approach to modelling phylogenetic similarity using string kernels and demonstrates its utility in the two-sample test and host-trait prediction. Other challenges arise from limitations in our understanding of the models themselves. For example, calculating variable importance (a key task in biomedical applications) is not possible for many models. This thesis describes a novel extension of an existing approach to compute importance scores for grouped variables in a Bayesian neural network. It also explores the behaviour of random forest classifiers when applied to microbial datasets, with a focus on the robustness of the biological findings under different modelling assumptions.Open Acces

    The nature of gut microbiota in early life:origin and impact of pioneer species

    Get PDF
    During early childhood, a complex ecosystem of thousands of species of microorganisms develops in our gastrointestinal tract. Disruptions in this development can have lifelong consequences and possibly increase the risk of diseases such as allergies and asthma. In this thesis, the influence of environmental and host factors on the early development of the microbiota was investigated. In addition, the microbiota composition of children in two birth cohorts in relation to the development of allergies and asthma was studied. The oral administration of specific beneficial bacteria, called probiotics, may be a way to specifically manipulate the gut microbiota to achieve health benefits. The last research studied how administering different types of probiotics, consisting of lactobacilli and bifid bacteria, affects the microbiota and health of premature infants. This thesis highlights that early childhood is a critical period in which targeted manipulation of the gut microbiota is possible in order to promote a healthy future and prevent the development of allergies and asthma

    Development of early life gut resistome and mobilome across gestational ages and microbiota-modifying treatments

    Get PDF
    Background: Gestational age (GA) and associated level of gastrointestinal tract maturation are major factors driving the initial gut microbiota composition in preterm infants. Besides, compared to term infants, premature infants often receive antibiotics to treat infections and probiotics to restore optimal gut microbiota. How GA, antibiotics, and probiotics modulate the microbiota\u27s core characteristics, gut resistome and mobilome, remains nascent. Methods: We analysed metagenomic data from a longitudinal observational study in six Norwegian neonatal intensive care units to describe the bacterial microbiota of infants of varying GA and receiving different treatments. The cohort consisted of probiotic-supplemented and antibiotic-exposed extremely preterm infants (n = 29), antibiotic-exposed very preterm (n = 25), antibiotic-unexposed very preterm (n = 8), and antibiotic-unexposed full-term (n = 10) infants. The stool samples were collected on days of life 7, 28, 120, and 365, and DNA extraction was followed by shotgun metagenome sequencing and bioinformatical analysis. Findings: The top predictors of microbiota maturation were hospitalisation length and GA. Probiotic administration rendered the gut microbiota and resistome of extremely preterm infants more alike to term infants on day 7 and ameliorated GA-driven loss of microbiota interconnectivity and stability. GA, hospitalisation, and both microbiota-modifying treatments (antibiotics and probiotics) contributed to an elevated carriage of mobile genetic elements in preterm infants compared to term controls. Finally, Escherichia coli was associated with the highest number of antibiotic-resistance genes, followed by Klebsiella pneumoniae and Klebsiella aerogenes. Interpretation: Prolonged hospitalisation, antibiotics, and probiotic intervention contribute to dynamic alterations in resistome and mobilome, gut microbiota characteristics relevant to infection risk. Funding: Odd-Berg Group, Northern Norway Regional Health Authority

    Multi-level analysis of the gut-brain axis shows autism spectrum disorder-associated molecular and microbial profiles

    Get PDF
    Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by heterogeneous cognitive, behavioral and communication impairments. Disruption of the gut-brain axis (GBA) has been implicated in ASD although with limited reproducibility across studies. In this study, we developed a Bayesian differential ranking algorithm to identify ASD-associated molecular and taxa profiles across 10 cross-sectional microbiome datasets and 15 other datasets, including dietary patterns, metabolomics, cytokine profiles and human brain gene expression profiles. We found a functional architecture along the GBA that correlates with heterogeneity of ASD phenotypes, and it is characterized by ASD-associated amino acid, carbohydrate and lipid profiles predominantly encoded by microbial species in the genera Prevotella, Bifidobacterium, Desulfovibrio and Bacteroides and correlates with brain gene expression changes, restrictive dietary patterns and pro-inflammatory cytokine profiles. The functional architecture revealed in age-matched and sex-matched cohorts is not present in sibling-matched cohorts. We also show a strong association between temporal changes in microbiome composition and ASD phenotypes. In summary, we propose a framework to leverage multi-omic datasets from well-defined cohorts and investigate how the GBA influences ASD

    Machine learning approaches in microbiome research: challenges and best practices

    Get PDF
    Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications
    corecore