295 research outputs found

    Evaluation of machine learning approaches for prediction of protein coding genes in prokaryotic DNA sequences

    Get PDF
    According to the National Human Genome Research Institute the amount of genomic data generated on a yearly basis is constantly increasing. This rapid growth in genomic data has led to a subsequent surge in the demand for efficient analysis and handling of said data. Gene prediction involves identifying the areas of a DNA sequence that code for proteins, also called protein coding genes. This task falls within the scope of bioinformatics, and there has been surprisingly little development in this field of study, over the past years. Despite there being sufficient state-of-the-art gene prediction tools, there is still room for improvement in terms of efficiency and accuracy. Advances made within the field of gene prediction can, among other things, aid the medical and pharmaceutical industry, as well as impact environmental and anthropological research. Machine learning techniques such as the Random Forest classifiers and Artificial Neural Networks (ANN) have proved successful at the task of gene prediction. In this thesis one deep learning model and two other machine learning models were tested. The first model implemented was the established Random Forest classifier. When it comes to the use of ensemble methods, such as the Random Forest classifier, feature engineering is critical for the success of such models. The exploration of different feature selection and extraction techniques underpinned its relevance. It also showed that feature importance varies greatly among genomes, and revealed possibilities that can be further explored in future work. The second model tested was the ensemble method Extreme Gradient Boosting (XGBoost), which served as a good competitor to the Random Forest classifier. Finally, a Recurrent Neural Network (RNN) was implemented. RNNs are known to be good with handling sequential data, therefore it seemed like a good candidate for gene prediction. The 15 prokaryotic genomes used to train the models were extracted from the NCBI genome database. Each model was tasked with classifying sub-sequences of the genomes, called open reading frames (ORFs), as either protein coding ORFs, or non-coding ORFs. One challenge when preparing these datasets was that the number of protein coding ORFs was very small compared to the number of non-coding ORFs. Another problem encountered in the dataset was that protein coding ORFs in general are longer than non-coding ORFs, which can bias the models to simply classify long ORFs as protein coding, and short ORFs as non-coding. For these reasons, two datasets for each genome were created, taking each imbalance into account. The models were trained, tuned and tested on both datasets for all genomes, and a combination of genomes. The models were evaluated with regard to accuracy, precision and recall. The results show that all three methods have potential and attained somewhat similar performance scores. Despite the fact that both time and data were limited during model development, they still yielded promising results. Considering there are several parameters that have not yet been tuned in all models, many possibilities for further research remain. The fact that a relatively simple RNN architecture performed so well, and has no requirement for feature engineering, shows great promise for further applications in gene prediction, and possibly other fields in bioinformatics.M-D

    Processing hidden Markov models using recurrent neural networks for biological applications

    Get PDF
    Philosophiae Doctor - PhDIn this thesis, we present a novel hybrid architecture by combining the most popular sequence recognition models such as Recurrent Neural Networks (RNNs) and Hidden Markov Models (HMMs). Though sequence recognition problems could be potentially modelled through well trained HMMs, they could not provide a reasonable solution to the complicated recognition problems. In contrast, the ability of RNNs to recognize the complex sequence recognition problems is known to be exceptionally good. It should be noted that in the past, methods for applying HMMs into RNNs have been developed by other researchers. However, to the best of our knowledge, no algorithm for processing HMMs through learning has been given. Taking advantage of the structural similarities of the architectural dynamics of the RNNs and HMMs, in this work we analyze the combination of these two systems into the hybrid architecture. To this end, the main objective of this study is to improve the sequence recognition/classi_cation performance by applying a hybrid neural/symbolic approach. In particular, trained HMMs are used as the initial symbolic domain theory and directly encoded into appropriate RNN architecture, meaning that the prior knowledge is processed through the training of RNNs. Proposed algorithm is then implemented on sample test beds and other real time biological applications

    Automatic intron detection in metagenomes using neural networks.

    Get PDF
    Tato práce se zabývá detekcí intronů v metagenomech hub pomocí hlubokých neuronových sítí. Přesné biologické mechanizmy rozpoznávání a vyřezávání intronů nejsou zatím plně známy a jejich strojová detekce není považovaná za vyřešený problém. Rozpoznávání a vyřezávání intronů z DNA sekvencí je důležité pro identifikaci genů v metagenomech a hledání jejich homologií mezi známými DNA sekvencemi,které jsou dostupné ve veřejných databázích. Rozpoznání genů a nalezení jejich případných homologů umožňuje identifikaci jak již známých tak i nových druhů a jejich taxonomické zařazení. V rámci práce vznikly dva modely neuronových sítí, které detekují začátky a konce intronů, takzvaná donorová a akceptorová místa sestřihu. Detekovaná místa sestřihu jsou následně zkombinována do kandidátních intronů. Překrývající se kandidátní introny jsou poté odstraněny pomocí jednoduchého skórovacího algoritmu. Práce navazuje na existující řešení, které využívá metody podpůrných vektorů (SVM). Výsledné neuronové sítě dosahují lepších výsledků než SVM a to při více než desetinásobně nižším výpočetním čase na zpracování stejně obsáhlého genomu.This work is concerned with the detection of introns in metagenomes with deep neural networks. Exact biological mechanisms of intron recognition and splicing are not fully known yet and their automated detection has remained unresolved. Detection and removal of introns from DNA sequences is important for the identification of genes in metagenomes and for searching for homologs among the known DNA sequences available in public databases. Gene prediction and the discovery of their homologs allows the identification of known and new species and their taxonomic classification. Two neural network models were developed as part of this thesis. The models' aim is the detection of intron starts and ends with the so-called donor and acceptor splice sites. The splice sites are later combined into candidate introns which are further filtered by a simple score-based overlap resolving algorithm. The work relates to an existing solution based on support vector machines (SVM). The resulting neural networks achieve better results than SVM and require more than order of magnitude less computational resources in order to process equally large genome

    On Computable Protein Functions

    Get PDF
    Proteins are biological machines that perform the majority of functions necessary for life. Nature has evolved many different proteins, each of which perform a subset of an organism’s functional repertoire. One aim of biology is to solve the sparse high dimensional problem of annotating all proteins with their true functions. Experimental characterisation remains the gold standard for assigning function, but is a major bottleneck due to resource scarcity. In this thesis, we develop a variety of computational methods to predict protein function, reduce the functional search space for proteins, and guide the design of experimental studies. Our methods take two distinct approaches: protein-centric methods that predict the functions of a given protein, and function-centric methods that predict which proteins perform a given function. We applied our methods to help solve a number of open problems in biology. First, we identified new proteins involved in the progression of Alzheimer’s disease using proteomics data of brains from a fly model of the disease. Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion protein sequences from metagenomes. Finally, we optimised a neural network method that extracts a small number of informative features from protein networks, which we used to predict functions of fission yeast proteins

    Decoding the Regulatory Genome: Quantitative Analysis of Transcriptional Regulation in Escherichia coli

    Get PDF
    Over the past decades DNA sequencing has become significantly cheaper and faster, which has enabled the accumulation of a huge amount of genomic data. However, much of this genomic data is illegible to us. For noncoding regions of the genome in particular, it is difficult to determine what role is played by specific DNA sequences. Here we focus on regions of DNA that play a role in transcriptional regulation. We develop models and techniques that allow us to discover new regulatory sequences and better understand how DNA sequence determines regulatory output. We start by considering how quantitative models serve as a powerful tool for testing our understanding of biological systems. We apply a statistical mechanical framework that incorporates the Monod-Wyman-Changeux model to analyze the effects of allostery in simple repression, using the lac operon as a test case. By fitting our model to experimental data, we are able to determine the values of the unknown parameter values in our model. We then show that we can use the model to accurately predict the induction responses of an array of simple repression constructs with a variety of repressor copy numbers and repressor binding energies. Next, we consider how the DNA sequence of a promoter region can provide details about how the promoter is regulated. We begin by describing an approach for discovering regulatory architectures for promoters whose regulation has not previously been studied. We focus on six promoters from E. coli including three well-studied promoters (rel, mar, and lac) to serve as test cases. We use the massively parallel reporter assay Sort-Seq to identify transcription factor binding sites with base-pair resolution, determine the regulatory role of each binding site, and infer energy matrices for each binding site. Then, we use DNA affinity chromatography and mass spectrometry to identify each transcription factor. We conclude with an in vivo approach for analyzing the sequence-dependence of transcription factor binding energies. Again using Sort-Seq, we show that we can represent transcription factor binding sites using energy matrices in absolute energy units. We then show that these energy matrices can be used to accurately predict the binding energies of mutated binding sites. We provide several examples of how understanding the relationship between DNA sequence and transcription factor binding provides us with a foundation for addressing additional scientific topics, such as the co-evolution of transcription factors and their binding sites.</p
    corecore