386 research outputs found

    Interpretable convolutional neural networks for effective translation initiation site prediction

    Get PDF
    Thanks to rapidly evolving sequencing techniques, the amount of genomic data at our disposal is growing increasingly large. Determining the gene structure is a fundamental requirement to effectively interpret gene function and regulation. An important part in that determination process is the identification of translation initiation sites. In this paper, we propose a novel approach for automatic prediction of translation initiation sites, leveraging convolutional neural networks that allow for automatic feature extraction. Our experimental results demonstrate that we are able to improve the state-of-the-art approaches with a decrease of 75.2% in false positive rate and with a decrease of 24.5% in error rate on chosen datasets. Furthermore, an in-depth analysis of the decision-making process used by our predictive model shows that our neural network implicitly learns biologically relevant features from scratch, without any prior knowledge about the problem at hand, such as the Kozak consensus sequence, the influence of stop and start codons in the sequence and the presence of donor splice site patterns. In summary, our findings yield a better understanding of the internal reasoning of a convolutional neural network when applying such a neural network to genomic data

    Utilizing Mutations to Evaluate Interpretability of Neural Networks on Genomic Data

    Full text link
    Even though deep neural networks (DNNs) achieve state-of-the-art results for a number of problems involving genomic data, getting DNNs to explain their decision-making process has been a major challenge due to their black-box nature. One way to get DNNs to explain their reasoning for prediction is via attribution methods which are assumed to highlight the parts of the input that contribute to the prediction the most. Given the existence of numerous attribution methods and a lack of quantitative results on the fidelity of those methods, selection of an attribution method for sequence-based tasks has been mostly done qualitatively. In this work, we take a step towards identifying the most faithful attribution method by proposing a computational approach that utilizes point mutations. Providing quantitative results on seven popular attribution methods, we find Layerwise Relevance Propagation (LRP) to be the most appropriate one for translation initiation, with LRP identifying two important biological features for translation: the integrity of Kozak sequence as well as the detrimental effects of premature stop codons.Comment: Accepted for publication at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), Workshop on Learning Meaningful Representations of Life (LMRL

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Artificial intelligence used in genome analysis studies

    Get PDF
    Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field

    Differential Architecture Search in Deep Learning for DNA Splice Site Classification

    Get PDF
    The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design

    Explainable deep learning models for biological sequence classification

    Get PDF
    Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms. This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates. Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice

    A textural deep neural network architecture for mechanical failure analysis

    Get PDF
    Nowadays, many classification problems are approached with deep learning architectures, and the results are outstanding compared to the ones obtained with traditional computer vision approaches. However, when it comes to texture, deep learning analysis has not had the same success as for other tasks. The texture is an inherent characteristic of objects, and it is the main descriptor for many applications in the computer vision field, however due to its stochastic appearance, it is difficult to obtain a mathematical model for it. According to the state of the art, deep learning techniques have some limitations when it comes to learning textural features; and, to classify texture using deep neural networks, it is essential to integrate them with handcrafted features or develop an architecture that resembles these features. By solving this problem, it would be possible to contribute in different applications, such as fractographic analysis. To achieve the best performance in any industry, it is important that the companies have a failure analysis, able to show the flaws’ causes, offer applications and solutions and generate alternatives that allow the customers to obtain more efficient components and productions. The failure of an industrial element has consequences such as significant economic losses, and in some cases, even human losses. With this analysis it is possible to examine the background of the damaged piece in order to find how and why it fails, and to help prevent future failures, in order to implement safer conditions. The visual inspection is the basis for the generation of every fractographic process in failure analysis and it is the main tool for fracture classification. This process is usually done by non-expert personnel on the topic, and normally they do not have the knowledge or experience required for the job, which, without question, increases the possibilities of generating a wrong classification and negatives results in the whole process. This research focuses on the development of a visual computer system that implements a textural deep learning architecture. Several approaches were taken into account, including combining deep learning techniques with traditional handcrafted features, and the development of a new architecture based on the wavelet transform and the multiresolution analysis. The algorithm was test on textural benchmark datasets and on the classification of mechanical fractures with particular texture and marks on surfaces of crystalline materials.Actualmente, diferentes problemas computacionales utilizan arquitecturas de aprendizaje profundo como enfoque principal. Obteniendo resultados sobresalientes comparados con los obtenidos por métodos tradicionales de visión por computador. Sin embargo, cuando se trata de texturas, los análisis de textura no han tenido el mismo éxito que para otras tareas. La textura es una característica inherente de los objetos y es el descriptor principal para diferentes aplicaciones en el campo de la visión por computador. Debido a su apariencia estocástica difícilmente se puede obtener un modelo matemático para describirla. De acuerdo con el estado-del-arte, las técnicas de aprendizaje profundo presentan limitaciones cuando se trata de aprender características de textura. Para clasificarlas, se hace esencial combinarlas con características tradicionales o desarrollar arquitecturas de aprendizaje profundo que reseemblen estas características. Al solucionar este problema es posible contribuir a diferentes aplicaciones como el análisis fractográfico. Para obtener el mejor desempeño en cualquier tipo de industria es importante obtener análisis fractográfico, el cual permite determinar las causas de los diferentes fallos y generar las alternativas para obtener componentes más eficientes. La falla de un elemento mecánico tiene consecuencias importantes tal como pérdidas económicas y en algunos casos incluso pérdidas humanas. Con estos análisis es posible examinar la historia de las piezas dañadas con el fin de entender porqué y cómo se dio el fallo en primer lugar y la forma de prevenirla. De esta forma implementar condiciones más seguras. La inspección visual es la base para la generación de todo proceso fractográfico en el análisis de falla y constituye la herramienta principal para la clasificación de fracturas. El proceso, usualmente, es realizado por personal no-experto en el tema, que normalmente, no cuenta con el conocimiento o experiencia necesarios requeridos para el trabajo, lo que sin duda incrementa las posibilidades de generar una clasificación errónea y, por lo tanto, obtener resultados negativos en todo el proceso. Esta investigación se centra en el desarrollo de un sistema visual de visión por computado que implementa una arquitectura de aprendizaje profundo enfocada en el análisis de textura. Diferentes enfoques fueron tomados en cuenta, incluyendo la combinación de técnicas de aprendizaje profundo con características tradicionales y el desarrollo de una nueva arquitectura basada en la transformada wavelet y el análisis multiresolución. El algorítmo fue probado en bases de datos de referencia en textura y en la clasificación de fracturas mecánicas en materiales cristalinos, las cuales presentan texturas y marcas características dependiendo del tipo de fallo generado sobre la pieza.Fundación CEIBADoctorad
    • …
    corecore