77 research outputs found

    Efficient Learning and Evaluation of Complex Concepts in Inductive Logic Programming

    No full text
    Inductive Logic Programming (ILP) is a subfield of Machine Learning with foundations in logic programming. In ILP, logic programming, a subset of first-order logic, is used as a uniform representation language for the problem specification and induced theories. ILP has been successfully applied to many real-world problems, especially in the biological domain (e.g. drug design, protein structure prediction), where relational information is of particular importance. The expressiveness of logic programs grants flexibility in specifying the learning task and understandability to the induced theories. However, this flexibility comes at a high computational cost, constraining the applicability of ILP systems. Constructing and evaluating complex concepts remain two of the main issues that prevent ILP systems from tackling many learning problems. These learning problems are interesting both from a research perspective, as they raise the standards for ILP systems, and from an application perspective, where these target concepts naturally occur in many real-world applications. Such complex concepts cannot be constructed or evaluated by parallelizing existing top-down ILP systems or improving the underlying Prolog engine. Novel search strategies and cover algorithms are needed. The main focus of this thesis is on how to efficiently construct and evaluate complex hypotheses in an ILP setting. In order to construct such hypotheses we investigate two approaches. The first, the Top Directed Hypothesis Derivation framework, implemented in the ILP system TopLog, involves the use of a top theory to constrain the hypothesis space. In the second approach we revisit the bottom-up search strategy of Golem, lifting its restriction on determinate clauses which had rendered Golem inapplicable to many key areas. These developments led to the bottom-up ILP system ProGolem. A challenge that arises with a bottom-up approach is the coverage computation of long, non-determinate, clauses. Prolog’s SLD-resolution is no longer adequate. We developed a new, Prolog-based, theta-subsumption engine which is significantly more efficient than SLD-resolution in computing the coverage of such complex clauses. We provide evidence that ProGolem achieves the goal of learning complex concepts by presenting a protein-hexose binding prediction application. The theory ProGolem induced has a statistically significant better predictive accuracy than that of other learners. More importantly, the biological insights ProGolem’s theory provided were judged by domain experts to be relevant and, in some cases, novel

    A NEW ILP SYSTEM FOR MODEL TRANSFORMATION BY EXAMPLES

    Get PDF

    A NEW ILP SYSTEM FOR MODEL TRANSFORMATION BY EXAMPLES

    Get PDF

    Statistical Relational Learning for Proteomics: Function, Interactions and Evolution

    Get PDF
    In recent years, the field of Statistical Relational Learning (SRL) [1, 2] has produced new, powerful learning methods that are explicitly designed to solve complex problems, such as collective classification, multi-task learning and structured output prediction, which natively handle relational data, noise, and partial information. Statistical-relational methods rely on some First- Order Logic as a general, expressive formal language to encode both the data instances and the relations or constraints between them. The latter encode background knowledge on the problem domain, and are use to restrict or bias the model search space according to the instructions of domain experts. The new tools developed within SRL allow to revisit old computational biology problems in a less ad hoc fashion, and to tackle novel, more complex ones. Motivated by these developments, in this thesis we describe and discuss the application of SRL to three important biological problems, highlighting the advantages, discussing the trade-offs, and pointing out the open problems. In particular, in Chapter 3 we show how to jointly improve the outputs of multiple correlated predictors of protein features by means of a very gen- eral probabilistic-logical consistency layer. The logical layer — based on grounding-specific Markov Logic networks [3] — enforces a set of weighted first-order rules encoding biologically motivated constraints between the pre- dictions. The refiner then improves the raw predictions so that they least violate the constraints. Contrary to canonical methods for the prediction of protein features, which typically take predicted correlated features as in- puts to improve the output post facto, our method can jointly refine all predictions together, with potential gains in overall consistency. In order to showcase our method, we integrate three stand-alone predictors of corre- lated features, namely subcellular localization (Loctree[4]), disulfide bonding state (Disulfind[5]), and metal bonding state (MetalDetector[6]), in a way that takes into account the respective strengths and weaknesses. The ex- perimental results show that the refiner can improve the performance of the underlying predictors by removing rule violations. In addition, the proposed method is fully general, and could in principle be applied to an array of heterogeneous predictions without requiring any change to the underlying software. In Chapter 4 we consider the multi-level protein–protein interaction (PPI) prediction problem. In general, PPIs can be seen as a hierarchical process occurring at three related levels: proteins bind by means of specific domains, which in turn form interfaces through patches of residues. Detailed knowl- edge about which domains and residues are involved in a given interaction has extensive applications to biology, including better understanding of the bind- ing process and more efficient drug/enzyme design. We cast the prediction problem in terms of multi-task learning, with one task per level (proteins, domains and residues), and propose a machine learning method that collec- tively infers the binding state of all object pairs, at all levels, concurrently. Our method is based on Semantic Based Regularization (SBR) [7], a flexible and theoretically sound SRL framework that employs First-Order Logic con- straints to tie the learning tasks together. Contrarily to most current PPI prediction methods, which neither identify which regions of a protein actu- ally instantiate an interaction nor leverage the hierarchy of predictions, our method resolves the prediction problem up to residue level, enforcing con- sistent predictions between the hierarchy levels, and fruitfully exploits the hierarchical nature of the problem. We present numerical results showing that our method substantially outperforms the baseline in several experi- mental settings, indicating that our multi-level formulation can indeed lead to better predictions. Finally, in Chapter 5 we consider the problem of predicting drug-resistant protein mutations through a combination of Inductive Logic Programming [8, 9] and Statistical Relational Learning. In particular, we focus on viral pro- teins: viruses are typically characterized by high mutation rates, which allow them to quickly develop drug-resistant mutations. Mining relevant rules from mutation data can be extremely useful to understand the virus adaptation mechanism and to design drugs that effectively counter potentially resistant mutants. We propose a simple approach for mutant prediction where the in- put consists of mutation data with drug-resistance information, either as sets of mutations conferring resistance to a certain drug, or as sets of mutants with information on their susceptibility to the drug. The algorithm learns a set of relational rules characterizing drug-resistance, and uses them to generate a set of potentially resistant mutants. Learning a weighted combination of rules allows to attach generated mutants with a resistance score as predicted by the statistical relational model and select only the highest scoring ones. Promising results were obtained in generating resistant mutations for both nucleoside and non-nucleoside HIV reverse transcriptase inhibitors. The ap- proach can be generalized quite easily to learning mutants characterized by more complex rules correlating multiple mutations

    Engineering Systems of Anti-Repressors for Next-Generation Transcriptional Programming

    Get PDF
    The ability to control gene expression in more precise, complex, and robust ways is becoming increasingly relevant in biotechnology and medicine. Synthetic biology has sought to accomplish such higher-order gene regulation through the engineering of synthetic gene circuits, whereby a gene’s expression can be controlled via environmental, temporal, or cellular cues. A typical approach to gene regulation is through transcriptional control, using allosteric transcription factors (TFs). TFs are regulatory proteins that interact with operator DNA elements located in proximity to gene promoters to either compromise or activate transcription. For many TFs, including the ones discussed here, this interaction is modulated by binding to a small molecule ligand for which the TF evolved natural specificity and a related metabolism. This modulation can occur with two main phenotypes: a TF shows the repressor (X+) phenotype if its binding to the ligand causes it to dissociate from the DNA, allowing transcription, while a TF shows the anti-repressor (XA) phenotype if its binding to the ligand causes it to associate to the DNA, preventing transcription. While both functional phenotypes are vital components of regulatory gene networks, anti-repressors are quite rare in nature compared to repressors and thus must be engineered. We first developed a generalized workflow for engineering systems of anti-repressors from bacterial TFs in a family of transcription factors related to the ubiquitous lactose repressor (LacI), the LacI/GalR family. Using this workflow, which is based on a re-routing of the TF’s allosteric network, we engineered anti-repressors in the fructose repressor (anti-FruR – responsive to fructose-1,6-phosphate) and ribose repressor (anti-RbsR – responsive to D-ribose) scaffolds, to complement XA TFs engineered previously in the LacI scaffold (anti-LacI – responsive to IPTG). Engineered TFs were then conferred with alternate DNA binding. To demonstrate their utility in synthetic gene circuits, systems of engineered TFs were then deployed to construct transcriptional programs, achieving all of the NOT-oriented Boolean logical operations – NOT, NOR, NAND, and XNOR – in addition to BUFFER and AND. Notably, our gene circuits built using anti-repressors are far simpler in design and, therefore, exert decreased burden on the chassis cells compared to the state-of-the-art as anti-repressors represent compressed logical operations (gates). Further, we extended this workflow to engineer ligand specificity in addition to regulatory phenotype. Performing the engineering workflow with a fourth member of the LacI/GalR family, the galactose isorepressor (GalS – naturally responsive to D-fucose), we engineered IPTG-responsive repressor and anti-repressor GalS mutants in addition to a D-fucose responsive anti-GalS TF. These engineered TFs were then used to create BANDPASS and BANDSTOP biological signal processing filters, themselves compressed compared to the state-of-the-art, and open-loop control systems. These provided facile methods for dynamic turning ‘ON’ and ‘OFF’ of genes in continuous growth in real time. This presents a general advance in gene regulation, moving beyond simple inducible promoters. We then demonstrated the capabilities of our engineered TFs to function in combinatorial logic using a layered logic approach, which currently stands as the state-of-the art. Using our anti-repressors in layered logic had the advantage of reducing cellular metabolic burden, as we were able to create the fundamental NOT/NOR operations with fewer genetic parts. Additionally, we created more TFs to use in layered logic approaches to prevent cellular cross-talk and minimize the number of TFs necessary to create these gene circuits. Here we demonstrated the successful deployment of our XA-built NOR gate system to create the BUFFER, NOT, NOR, OR, AND, and NAND gates. The work presented here describes a workflow for engineering (i) allosteric phenotype, (ii) ligand selectivity, and (iii) DNA specificity in allosteric transcription factors. The products of the workflow themselves serve as vital tools for the construction of next-generation synthetic gene circuits and genetic regulatory devices. Further, from the products of the workflow presented here, certain design heuristics can be gleaned, which should better facilitate the design of allosteric TFs in the future, moving toward a semi-rational engineering approach. Additionally, the work presented here outlines a transcriptional programming structure and metrology which can be broadly adapted and scaled for future applications and expansion. Consequently, this thesis presents a means for advanced control of gene expression, with promise to have long-reaching implications in the future.Ph.D

    A systems biology understanding of protein constraints in the metabolism of budding yeasts

    Get PDF
    Fermentation technologies, such as bread making and production of alcoholic beverages, have been crucial for development of humanity throughout history. Saccharomyces cerevisiae provides a natural platform for this, due to its capability to transform sugars into ethanol. This, and other yeasts, are now used for production of pharmaceuticals, including insulin and artemisinic acid, flavors, fragrances, nutraceuticals, and fuel precursors. In this thesis, different systems biology methods were developed to study interactions between metabolism, enzymatic capabilities, and regulation of gene expression in budding yeasts. In paper I, a study of three different yeast species (S. cerevisiae, Yarrowia lipolytica and Kluyveromyces marxianus), exposed to multiple conditions, was carried out to understand their adaptation to environmental stress. Paper II revises the use of genome-scale metabolic models (GEMs) for the study and directed engineering of diverse yeast species. Additionally, 45 GEMs for different yeasts were collected, analyzed, and tested. In paper III, GECKO 2.0, a toolbox for integration of enzymatic constraints and proteomics data into GEMs, was developed and used for reconstruction of enzyme-constrained models (ecGEMs) for three yeast species and model organisms. Proteomics data and ecGEMs were used to further characterize the impact of environmental stress over metabolism of budding yeasts. On paper IV, gene engineering targets for increased accumulation of heme in S. cerevisiae cells were predicted with an ecGEM. Predictions were experimentally validated, yielding a 70-fold increase in intracellular heme. The prediction method was systematized and applied to the production of 102 chemicals in S. cerevisiae (Paper V). Results highlighted general principles for systems metabolic engineering and enabled understanding of the role of protein limitations in bio-based chemical production. Paper VI presents a hybrid model integrating an enzyme-constrained metabolic network, coupled to a gene regulatory model of nutrient-sensing mechanisms in S. cerevisiae. This model improves prediction of protein expression patterns while providing a rational connection between metabolism and the use of nutrients from the environment.This thesis demonstrates that integration of multiple systems biology approaches is valuable for understanding the connection of cell physiology at different levels, and provides tools for directed engineering of cells for the benefit of society

    Exploring genomic medicine using integrative biology

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2004.Includes bibliographical references (p. 215-227).Instead of focusing on the cell, or the genotype, or on any single measurement modality, using integrative biology allows us to think holistically and horizontally. A disease like diabetes can lead to myocardial infarction, nephropathy, and neuropathy; to study diabetes in genomic medicine would require reasoning from a disease to all its various complications to the genome and back. I am studying the process of intersecting nearly-comprehensive data sets in molecular biology, across three representative modalities (microarrays, RNAi and quantitative trait loci) out of the more than 30 available today. This is difficult because the semantics and context of each experiment performed becomes more important, necessitating a detailed knowledge about the biological domain. I addressed this problem by using all public microarray data from NIH, unifying 50 million expression measurements with standard gene identifiers and representing the experimental context of each using the Unified Medical Language System, a vocabulary of over 1 million concepts. I created an automated system to join data sets related by experimental context.(cont.) I evaluated this system by finding genes significantly involved in multiple experiments directly and indirectly related to diabetes and adipogenesis and found genes known to be involved in these diseases and processes. As a model first step into integrative biology, I then took known quantitative trait loci in the rat involved in glucose metabolism and build an expert system to explain possible biological mechanisms for these genetic data using the modeled genomic data. The system I have created can link diseases from the ICD-9 billing code level down to the genetic, genomic, and molecular level. In a sense, this is the first automated system built to study the new field of genomic medicine.by Atul Janardhan Butte.Ph.D
    • 

    corecore