305 research outputs found

    MHCherryPan, a novel model to predict the binding affinity of pan-specific class I HLA-peptide

    Get PDF
    The human leukocyte antigen (HLA) system or complex plays an essential role in regulating the immune system in humans. Accurate prediction of peptide binding with HLA can efficiently help to identify those neoantigens, which potentially make a big difference in immune drug development. HLA is one of the most polymorphic genetic systems in humans, and thousands of HLA allelic versions exist. Due to the high polymorphism of HLA complex, it is still pretty difficult to accurately predict the binding affinity. In this thesis, we presented a new algorithm to combine convolutional neural network and long short-term memory to solve this problem. Compared with other current popular algorithms, our model achieved the state-of-the-art results

    Deep learning methods for mining genomic sequence patterns

    Get PDF
    Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine. This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation. There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns

    Nitrogenase Iron Protein Classification using CNN Neural Network

    Get PDF
    The nitrogenase iron protein (NifH) is extensively used to study nitrogen fixation, the ecologically vital process of reducing atmospheric nitrogen to a bioavailable form. The discovery rate of novel NifH sequences is high, and there is an ongoing need for software tools to mine NifH records from the GenBank repository. Since record annotations are unreliable, because they contain errors, classifiers based on sequence alone are required. The ARBitrator classifier is highly successful but must be initialized by extensive manual effort. A Deep Learning approach could substantially reduce manual intervention. However, attempts to build a character-based Deep Learning NifH classifier were unsuccessful. We hypothesized that we could generate visual representations of protein sequences and use a Convolutional Neural Network to classify the representations. Here we present the resulting classifier, which has achieved false positive and false negative rates of 0.19% and 0.22%, respectively

    Machine Learning Based Disease Gene Identification and MHC Immune Protein-peptide Binding Prediction

    Get PDF
    Machine learning and deep learning methods have been increasingly applied to solve challenging and important bioinformatics problems such as protein structure prediction, disease gene identification, and drug discovery. However, the performances of existing machine learning based predictive models are still not satisfactory. The question of how to exploit the specific properties of bioinformatics data and couple them with the unique capabilities of the learning algorithms remains elusive. In this dissertation, we propose advanced machine learning and deep learning algorithms to address two important problems: mislocation-related cancer gene identification and major histocompatibility complex-peptide binding affinity prediction. Our first contribution proposes a kernel-based logistic regression algorithm for identifying potential mislocation-related genes among known cancer genes. Our algorithm takes protein-protein interaction networks, gene expression data, and subcellular location gene ontology data as input, which is particularly lightweight comparing with existing methods. The experiment results demonstrate that our proposed pipeline has a good capability to identify mislocation-related cancer genes. Our second contribution addresses the modeling and prediction of human leukocyte antigen (HLA) peptide binding of human immune system. We present an allelespecific convolutional neural network model with one-hot encoding. With extensive evaluation over the standard IEDB datasets, it is shown that the performance of our model is better than all existing prediction models. To achieve further improvement, we propose a novel pan-specific model on peptide-HLA class I binding affinities prediction, which allows us to exploit all the training samples of different HLA alleles. iv Our sequence based pan model is currently the only algorithm not using pseudo sequence encoding — a dominant structure-based encoding method in this area. The benchmark studies show that our method could achieve state-of-the-art performance. Our proposed model could be integrated into existing ensemble methods to improve their overall prediction capabilities on highly diverse MHC alleles. Finally, we present a LSTM-CNN deep learning model with attention mechanism for peptide-HLA class II binding affinities and binding cores prediction. Our model achieved very good performance and outperformed existing methods on half of tested alleles. With the help of attention mechanism, our model could directly output the peptide binding core based on attention weight without any additional post- or preprocessing

    Bioinformatic analysis and deep learning on large-scale human transcriptomic data: studies on aging, Alzheimer’s neurodegeneration and cancer

    Get PDF
    [ES] El objetivo general del proyecto ha sido el análisis bioinformático integrativo de datos múltiples de proteómica y genómica combinados con datos clínicos asociados para la búsqueda de biomarcadores y módulos poligénicos causales aplicado a enfermedades complejas; principalmente, cáncer de origen primario desconocido, en sus distintos tipos y subtipos y enfermedades neurodegenerativas (ND) mayormente Alzheimer, además de neurodegeneración debida a la edad. Además, se ha hecho un uso intensivo de técnicas de inteligencia artificial, más en concreto de técnicas de redes neuronales de aprendizaje profundo para el análisis y pronóstico de dichas enfermedades
    • …
    corecore