5 research outputs found

    BowSaw: inferring higher-order trait interactions associated with complex biological phenotypes

    Get PDF
    Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g. from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue towards new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset, and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.Accepted manuscrip

    Machine learning for microbial ecology: predicting interactions and identifying their putative mechanisms

    Get PDF
    Microbial communities are key components of Earth’s ecosystems and they play important roles in human health and industrial processes. These communities and their functions can strongly depend on the diverse interactions between constituent species, posing the question of how such interactions can be predicted, measured and controlled. This challenge is particularly relevant for the many practical applications enabled by the rising field of synthetic microbial ecology, which includes the design of microbiome therapies for human diseases. Advances in sequencing technologies and genomic databases provide valuable datasets and tools for studying inter-microbial interactions, but the capacity to characterize the strength and mechanisms of interactions between species in large consortia is still an unsolved challenge. In this thesis, I show how machine learning methods can be used to help address these questions. The first portion of my thesis work was focused on predicting the outcome of pairwise interactions between microbial species. By integrating genomic information and observed experimental data, I used machine learning algorithms to explore the predictive relationship between single-species traits and inter-species interaction phenotypes. I found that organismal traits (e.g. annotated functions of genomic elements) are sufficient to predict the qualitative outcome of interactions between microbes. I also found that the relative fraction of possible experiments needed to build acceptable models drastically shrinks as the combinatorial space grows. In the second part of my thesis work, I developed an algorithmic method for identifying putative interaction mechanisms by scoring combinations of variables that random forest uses in order to predict interaction outcomes. I applied this method to a study of the human microbiome and identified a previously unreported combination of microbes that are strongly associated with Crohn’s disease. In the last part of my thesis, I utilized a regression approach to first identify and then quantify interactions between microbial species relevant to community function. The work I present in this dissertation provides a general framework for understanding the myriad interactions that occur in natural and synthetic microbial consortia

    Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

    Get PDF
    In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

    Machine Learning Approaches for the Prioritisation of Cardiovascular Disease Genes Following Genome- wide Association Study

    Get PDF
    Genome-wide association studies (GWAS) have revealed thousands of genetic loci, establishing itself as a valuable method for unravelling the complex biology of many diseases. As GWAS has grown in size and improved in study design to detect effects, identifying real causal signals, disentangling from other highly correlated markers associated by linkage disequilibrium (LD) remains challenging. This has severely limited GWAS findings and brought the method’s value into question. Although thousands of disease susceptibility loci have been reported, causal variants and genes at these loci remain elusive. Post-GWAS analysis aims to dissect the heterogeneity of variant and gene signals. In recent years, machine learning (ML) models have been developed for post-GWAS prioritisation. ML models have ranged from using logistic regression to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models (i.e., neural networks). When combined with functional validation, these methods have shown important translational insights, providing a strong evidence-based approach to direct post-GWAS research. However, ML approaches are in their infancy across biological applications, and as they continue to evolve an evaluation of their robustness for GWAS prioritisation is needed. Here, I investigate the landscape of ML across: selected models, input features, bias risk, and output model performance, with a focus on building a prioritisation framework that is applied to blood pressure GWAS results and tested on re-application to blood lipid traits

    Genome-wide analyses to identify biomarkers of toxicity in the treatment of advanced colorectal cancer

    Get PDF
    Background Chemotherapies administered at normal therapeutic dosages can cause significant side-effects and may result in early treatment discontinuation. Inter-individual variation in toxicity highlights the need for biomarkers to personalise treatment. Inherited genetic variants are increasingly being recognised to cause chemotherapyinduced toxicity. Aim I sought such biomarkers by conducting genome-wide association studies, together with gene and gene set analyses, for ten toxicities in 1800 patients with advanced colorectal cancer (CRC) treated with oxaliplatin and fluoropyrimidine chemotherapy ± cetuximab. Materials and Methods Patients were from the MRC COIN and COIN-B trials. 385 received folinic acid, fluorouracil and oxaliplatin (FOLFOX), 360 FOLFOX + cetuximab, 707 capecitabine and oxaliplatin (XELOX) and 348 XELOX + cetuximab. Common and low-frequency single nucleotide polymorphisms (SNPs), genes and gene sets that reached genome-wide or suggestive significance were replicated in independent patient groups, clinical trial cohorts and participants from the UK Biobank and Genomics England. Meta-analyses were also performed to increase power. Results rs13260246 at 8q21.13 was significantly associated with vomiting in patients treated with XELOX (Odds Ratio [OR]=5.0, 95% Confidence Interval [CI]=3.0-8.3, P=9.8x10- 10) but failed independent replication. SNPs at 139 loci had suggestive associations for toxicities and lead SNPs at five were replicated. rs6783836 in ST6GAL1 was associated with hand‐foot syndrome (HFS) in patients treated with XELOX (OR=3.1, 95% CI=2.1‐4.6, P=4.3x10‐8) and ST6GAL1 was associated with type-2 diabetes (a risk factor for HFS). A low-frequency nonsynonymous variant in the antigen processing 1 signature region was suggestive of an association with sepsis (OR=6.1, 95% CI=3.0-12.8, P=1.2x10-6). rs4760830 in TRHDE was associated with diarrhoea in patients treated with capecitabine (OR=0.6, 95% CI=0.50-0.72, P=4.8x10-8). In MAGMA gene analyses, MROH5 was significantly associated with neutropenia (P=6.6x10-7) and was independently replicated
    corecore