2 research outputs found
Recommended from our members
Interpretable Machine Learning: Applications in Biology and Genomics
Machine learning (ML) and deep learning (DL) models impact our daily lives with applications in natural language modeling, image analysis, healthcare, genomics, and bioinformatics. The exponential growth of biological sequence data necessitates accompanying advances in computational methods. Although deep learning is highly effective for detecting and classifying biological sequences, challenges remain in extracting meaningful patterns and information from the learned models. To realize the potential of deep learning in biology, we need to develop strategies for model interpretation to reveal or further clarify biological principles. In this thesis, we first present problems and methods to classify patterns in biological sequence data. Next, we describe a series of techniques we developed to understand the machine learning models and identify meaningful biological patterns. For each problem we created an interpretable, intelligent system without sacrificing performance. To test our approaches for model interpretation, we first focused our analysis on known biological patterns, and then extended the search beyond what is known. This work can be categorized into four different applications: I) the development of bpRNA, a novel annotation tool capable of parsing RNA secondary structures. bpRNA is a richly-annotated database that contains over 100,000 structures from seven different sources along with base pairing information. II) The detection of pseudoknots from sequence data alone with a machine learning model, Pseudoknow. As one of the most common RNA structural motifs, pseudoknots are crucial for RNA regulation. Improving the prediction of RNA pseudoknot structure will allow for better understanding of how RNA structure informs regulation and metabolism. III) Classification from gene expression data using stacked denoising auto encoders (SDAE) to distinguish healthy cells from cancerous ones, and to predict post-mortem time-of-death. These classification methods were developed with the goal to identify genes that are most informative for prediction and hence most biological relevant. Our study suggests that the most influential genes from the dimensionality reduction performed by SDAE were highly predictive of cancerous vs non-cancerous cell type. IV) Interpretation of the rules learned by a deep convolutional neural network to recognize known and previously uncharacterized core promoter sequence motifs from the whole genome sequences of human. We proposed and compared new training strategies to identify transcription start sites (TSS), located within core promoters, from biological sequences. The main goal of this application was to develop new strategies to interpret how the convolutional neural network learns biological patterns, and to understand the correlations between and within the convolutional layers. These new techniques could aid in deriving unknown patterns in biology and genomics and are applicable more broadly to other areas of data science
A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential
The current deluge of newly identified RNA transcripts presents a singular opportunity for improved assessment of coding potential, a cornerstone of genome annotation, and for machine-driven discovery of biological knowledge. While traditional, feature-based methods for RNA classification are limited by current scientific knowledge, deep learning methods can independently discover complex biological rules in the data de novo. We trained a gated recurrent neural network (RNN) on human messenger RNA (mRNA) and long noncoding RNA (lncRNA) sequences. Our model, mRNA RNN (mRNN), surpasses state-of-the-art methods at predicting protein-coding potential despite being trained with less data and with no prior concept of what features define mRNAs. To understand what mRNN learned, we probed the network and uncovered several context-sensitive codons highly predictive of coding potential. Our results suggest that gated RNNs can learn complex and long-range patterns in full-length human transcripts, making them ideal for performing a wide range of difficult classification tasks and, most importantly, for harvesting new biological insights from the rising flood of sequencing data