3,439 research outputs found

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Unsupervised and Supervised Learning for RNA-protein Interactions and Annotations

    Get PDF
    This project analyzed the base and amino acid interactions and annotations through the use of unsupervised and supervised learning techniques. For unsupervised learning, clustering found the data was not able to be distinguished into clear groups which matched the original annotations through kmeans clustering and hierarchical clustering. For supervised learning, the use of random forest, glmnet, and deep learning neural networks were successful in creating accurate predictions. However, machine learning likely will not be able to replace the original complex program, but could be used for possible simplification

    DNAffinity: a machine-learning approach to predict DNA binding affinities of transcription factors

    Full text link
    We present a physics-based machine learning approach to predict in vitro transcription factor binding affinities from structural and mechanical DNA properties directly derived from atomistic molecular dynamics simulations. The method is able to predict affinities obtained with techniques as different as uPBM, gcPBM and HT-SELEX with an excellent performance, much better than existing algorithms. Due to its nature, the method can be extended to epigenetic variants, mismatches, mutations, or any non-coding nucleobases. When complemented with chromatin structure information, our in vitro trained method provides also good estimates of in vivo binding sites in yeast

    Inference of RNA decay rate from transcriptional profiling highlights the regulatory programs of Alzheimer's disease.

    Get PDF
    The abundance of mRNA is mainly determined by the rates of RNA transcription and decay. Here, we present a method for unbiased estimation of differential mRNA decay rate from RNA-sequencing data by modeling the kinetics of mRNA metabolism. We show that in all primary human tissues tested, and particularly in the central nervous system, many pathways are regulated at the mRNA stability level. We present a parsimonious regulatory model consisting of two RNA-binding proteins and four microRNAs that modulate the mRNA stability landscape of the brain, which suggests a new link between RBFOX proteins and Alzheimer's disease. We show that downregulation of RBFOX1 leads to destabilization of mRNAs encoding for synaptic transmission proteins, which may contribute to the loss of synaptic function in Alzheimer's disease. RBFOX1 downregulation is more likely to occur in older and female individuals, consistent with the association of Alzheimer's disease with age and gender."mRNA abundance is determined by the rates of transcription and decay. Here, the authors propose a method for estimating the rate of differential mRNA decay from RNA-seq data and model mRNA stability in the brain, suggesting a link between mRNA stability and Alzheimer's disease.

    Exploring feature identification and machine learning in predicting protein-protein interactions of disordered proteins

    Get PDF
    Intrinsically disordered regions (IDRs) in proteins have been linked to many crucial functions, including mediating protein-protein interactions (PPIs), despite lacking a single invariant three-dimensional structure. This growing recognition has led to an increased demand for computational studies that focus on the amino acid sequences corresponding to proteins to identify crucial sequence characteristics in IDRs and their connections to diverse cellular functions. In the first part of this thesis,we have put forward two statistical methods to identify sequence features responsible for IDR functions. We introduce a statistical approach for quantifying the periodicity of aromatic residues in the human proteome by modeling their occurrence using a Poisson process. Next, we introduce another statistical analysis of IDR sequences to identify co-occurring amino acid groups in transcription factors (TFs) that co-bind to enhancer elements. In the second part of the thesis, our focus shifts to predicting PPIs using only protein sequences. In this thesis, we present a novel method to address PPI prediction challenge using IDR sequences. We encountered challenges while developing a PPI prediction model because our task essentially involves making predictions based on pairs of input data. In this regard, we present two distinct machine learning algorithms to address two different types of PPI prediction problems, namely, asymmetric and symmetric problems. For the asymmetric problem, where one of the proteins has already been included in the classifier, we develop a method to predict disordered protein partners of the known proteins in our dataset. On the other hand, for the symmetric problem, we implement another approach to predict entirely novel PPIs. Furthermore, we explore whether IDR amino acid sequences outperform other sequence components, including entire sequences and non-IDR regions, in predicting PPIs. Our findings led us to the conclusion that disordered regions are particularly valuable in predicting interactions between intrinsically disordered proteins. In summary, this thesis provides insights into dealing with paired nature datasets when developing machine learning models for PPI prediction and demonstrates how statistical approaches can be used to investigate IDR sequences for feature identification and predict PPIs based on IDR sequences.Intrinsically disordered regions (IDRs) in Proteinen wurden mit vielen wichtigen Funktionen assoziiert, obwohl ihnen eine einzelne unveränderliche 3-dimensionale Struktur fehlt, unter anderem die Vermittlung von Protein-Protein-Interaktionen (PPIs). Die wachsende Erkenntnis über die Bedeutung von IDRs hat zu einer erhöhten Nachfrage nach computergestützten Studien geführt, die sich auf die Aminosäuresequenzen von Proteinen konzentrieren, um entscheidende Sequenzmerkmale in IDRs und ihre Verbindungen zu verschiedenen zellulären Funktionen zu identifizieren. Im ersten Teil dieser Arbeit stellen wir zwei statistische Methoden zur Identifikation von Sequenzmerkmalen vor, die für IDR-Funktionen verantwortlich sind. Wir präsentieren einen statistischen Ansatz zur Quantifizierung der Periodizität aromatischer Rückstände im menschlichen Proteom durch Modellierung ihres Auftretens anhand eines Poisson-Prozesses. Außerdem führen wir eine weitere statistische Analyse von IDR-Sequenzen ein, um gemeinsam auftretende Aminosäuregruppen in Transkriptionsfaktoren (TFs) zu entdecken, die zusammen an Enhancer-Elemente binden. Im zweiten Teil der Arbeit liegt unser Fokus auf der Vorhersage von PPIs nur aus Proteinsequenzen. Hier präsentieren wir eine neue Methode, um die Herausforderung der PPI-Vorhersage unter Verwendung von IDR-Sequenzen anzugehen. Wir stießen bei der Entwicklung eines PPI-Vorhersagemodells auf Herausforderungen, da unsere Aufgabe im Prinzip darin besteht, Vorhersagen auf der Grundlage von Paaren von Eingabedaten zu treffen. In diesem Zusammenhang stellen wir zwei unterschiedliche Algorithmen für maschinelles Lernen vor, um zwei PPI-Vorhersageproblemen zu lösen, nämlich asymmetrische und symmetrische Probleme. Für das asymmetrische Problem, bei dem eines der Proteine bereits im Klassifizierer enthalten ist, entwickeln wir eine Methode zur Vorhersage ungeordneter Proteinpartner bekannter Proteine in unserem Datenset. Für das symmetrische Problem implementieren wir hingegen einen anderen Ansatz, um völlig neue PPIs vorherzusagen. Zudem prüfen wir, ob IDR-Aminosäuresequenzen andere Sequenzkomponenten, einschließlich ganzer Sequenzen und Nicht-IDR-Regionen, in der PPI-Vorhersage übertreffen. Unsere Ergebnisse führen zu der Schlussfolgerung, dass ungeordnete Regionen besonders wertvoll für die Vorhersage von Interaktionen zwischen intrinsisch ungeordneten Proteinen sind. Zusammenfassend liefert diese Arbeit Erkenntnisse über den Umgang mit gepaarten Datensätzen bei der Entwicklung von maschinellen Lernmodellen für die PPI-Vorhersage. Wir zeigen, wie statistische Ansätze verwendet werden können, um IDR-Sequenzen für die Merkmalsidentifizierung zu untersuchen und PPIs basierend auf IDR-Sequenzen vorherzusagen

    DeepSRE: Identification of sterol responsive elements and nuclear transcription factors Y proximity in human DNA by Convolutional Neural Network analysis

    Get PDF
    SREBP1 and 2, are cholesterol sensors able to modulate cholesterol-related gene expression responses. SREBPs binding sites are characterized by the presence of multiple target sequences as SRE, NFY and SP1, that can be arranged differently in different genes, so that it is not easy to identify the binding site on the basis of direct DNA sequence analysis. This paper presents a complete workflow based on a one-dimensional Convolutional Neural Network (CNN) model able to detect putative SREBPs binding sites irrespective of target elements arrangements. The strategy is based on the recognition of SRE linked (less than 250 bp) to NFY sequences according to chromosomal localization derived from TF Immunoprecipitation (TF ChIP) experiments. The CNN is trained with several 100 bp sequences containing both SRE and NF-Y. Once trained, the model is used to predict the presence of SRE-NFY in the first 500 bp of all the known gene promoters. Finally, genes are grouped according to biological process and the processes enriched in genes containing SRE-NFY in their promoters are analyzed in details. This workflow allowed to identify biological processes enriched in SRE containing genes not directly linked to cholesterol metabolism and possible novel DNA patterns able to fill in for missing classical SRE sequences

    Structural Property Prediction

    Full text link
    While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. Some structural properties of proteins that are closely linked to their function may be easier (or much faster) to predict from sequence than the complete tertiary structure; for example, secondary structure, surface accessibility, flexibility, disorder, interface regions or hydrophobic patches. Serving as building blocks for the native protein fold, these structural properties also contain important structural and functional information not apparent from the amino acid sequence. Here, we will first give an introduction into the application of machine learning for structural property prediction, and explain the concepts of cross-validation and benchmarking. Next, we will review various methods that incorporate knowledge of these concepts to predict those structural properties, such as secondary structure, surface accessibility, disorder and flexibility, and aggregation.Comment: editorial responsability: Juami H. M. van Gils, K. Anton Feenstra, Sanne Abeln. This chapter is part of the book "Introduction to Protein Structural Bioinformatics". The Preface arXiv:1801.09442 contains links to all the (published) chapter

    ENNGene : an Easy Neural Network model building tool for Genomics

    Get PDF
    Background: The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field. Results: Here we present ENNGene—Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein. Conclusions: As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.peer-reviewe
    • …
    corecore