1,040 research outputs found

    Discovering Patterns from Sequences with Applications to Protein-Protein and Protein-DNA Interaction

    Get PDF
    Understanding Protein-Protein and Protein-DNA interaction is of fundamental importance in deciphering gene regulation and other biological processes in living cells. Traditionally, new interaction knowledge is discovered through biochemical experiments that are often labor intensive, expensive and time-consuming. Thus, computational approaches are preferred. Due to the abundance of sequence data available today, sequence-based interaction analysis becomes one of the most readily applicable and cost-effective methods. One important problem in sequence-based analysis is to identify the functional regions from a set of sequences within the same family or demonstrating similar biological functions in experiments. The rationale is that throughout evolution the functional regions normally remain conserved (intact), allowing them to be identified as patterns from a set of sequences. However, there are also mutations such as substitution, insertion, deletion in these functional regions. Existing methods, such as those based on position weight matrices, assume that the functional regions have a fixed width and thus cannot not identify functional regions with mutations, particularly those with insertion or deletion mutations. Recently, Aligned Pattern Clustering (APCn) was introduced to identify functional regions as Aligned Pattern Clusters (APCs) by grouping and aligning patterns with variable width. Nevertheless, APCn cannot discover functional regions with substitution, insertion and/or deletion mutations, since their frequencies of occurrences are too low to be considered as patterns. To overcome such an impasse, this thesis proposes a new APC discovery algorithm known as Pattern-Directed Aligned Pattern Clustering (PD-APCn). By first discovering seed patterns from the input sequence data, with their sequence positions located and recorded on an address table, PD-APCn can use the seed patterns to direct the incremental extension of functional regions with minor mutations. By grouping the aligned extended patterns, PD-APCn can recruit patterns adaptively and efficiently with variable width without relying on exhaustive optimal search. Experiments on synthetic datasets with different sizes and noise levels showed that PD-APCn can identify the implanted pattern with mutations, outperforming the popular existing motif-finding software MEME with much higher recall and Fmeasure over a computational speed-up of up to 665 times. When applying PD-APCn on datasets from Cytochrome C and Ubiquitin protein families, all key binding sites conserved in the families were captured in the APC outputs. In sequence-based interaction analysis, there is also a lack of a model for co-occurring functional regions with mutations, where co-occurring functional regions between interaction sequences are indicative of binding sites. This thesis proposes a new representation model Co-Occurrence APCs to capture co-occurring functional regions with mutations from interaction sequences in database transaction format. Applications on Protein-DNA and Protein-Protein interaction validated the capability of Co-Occurrence APCs. In Protein-DNA interaction, a new representation model, Protein-DNA Co-Occurrence APC, was developed for modeling Protein-DNA binding cores. The new model is more compact than the traditional one-to-one pattern associations, as it packs many-to-many associations in one model, yet it is detailed enough to allow site-specific variants. An algorithm, based on Co-Support Score, was also developed to discover Protein-DNA Co-Occurrence APCs from Protein-DNA interaction sequences. This algorithm is 1600x faster in run-time than its contemporaries. New Protein-DNA binding cores indicated by Protein-DNA Co-Occurrence APCs were also discovered via homology modeling as a proof-of-concept. In Protein-Protein interaction, a new representation model, Protein-Protein Co-Occurrence APC, was developed for modeling the co-occurring sequence patterns in Protein-Protein Interaction between two protein sequences. A new algorithm, WeMine-P2P, was developed for sequence-based Protein-Protein Interaction machine learning prediction by constructing feature vectors leveraging Protein-Protein Co-Occurrence APCs, based on novel scores such as Match Score, MaxMatch Score and APC-PPI score. Through 40 independent experiments, it outperformed the well-known algorithm, PIPE2, which also uses co-occurring functional regions while not allowing variable widths and mutations. Both applications on Protein-Protein and Protein-DNA interaction have indicated the potential use of Co-Occurrence APC for exploring other types of biosequence interaction in the future

    SAERMA: Stacked Autoencoder Rule Mining Algorithm for the Interpretation of Epistatic Interactions in GWAS for Extreme Obesity

    Get PDF
    One of the most important challenges in the analysis of high-throughput genetic data is the development of efficient computational methods to identify statistically significant Single Nucleotide Polymorphisms (SNPs). Genome-wide association studies (GWAS) use single-locus analysis where each SNP is independently tested for association with phenotypes. The limitation with this approach, however, is its inability to explain genetic variation in complex diseases. Alternative approaches are required to model the intricate relationships between SNPs. Our proposed approach extends GWAS by combining deep learning stacked autoencoders (SAEs) and association rule mining (ARM) to identify epistatic interactions between SNPs. Following traditional GWAS quality control and association analysis, the most significant SNPs are selected and used in the subsequent analysis to investigate epistasis. SAERMA controls the classification results produced in the final fully connected multi-layer feedforward artificial neural network (MLP) by manipulating the interestingness measures, support and confidence, in the rule generation process. The best classification results were achieved with 204 SNPs compressed to 100 units (77% AUC, 77% SE, 68% SP, 53% Gini, logloss=0.58, and MSE=0.20), although it was possible to achieve 73% AUC (77% SE, 63% SP, 45% Gini, logloss=0.62, and MSE=0.21) with 50 hidden units - both supported by close model interpretation

    A computational framework for protein-DNA binding discovery.

    Get PDF
    Wong, Ka Chun.Thesis (M.Phil.)--Chinese University of Hong Kong, 2010.Includes bibliographical references (leaves 109-121).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgements --- p.ivList of Figures --- p.ixList of Tables --- p.xiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.1Chapter 1.2 --- Objective --- p.2Chapter 1.3 --- Methodology --- p.2Chapter 1.4 --- Bioinforrnatics --- p.2Chapter 1.5 --- Computational Methods --- p.3Chapter 1.5.1 --- Evolutionary Algorithms --- p.3Chapter 1.5.2 --- Data Mining for TF-TFBS bindings --- p.4Chapter 2 --- Background --- p.5Chapter 2.1 --- Gene Transcription --- p.5Chapter 2.1.1 --- Protein-DNA Binding --- p.6Chapter 2.1.2 --- Existing Methods --- p.6Chapter 2.1.3 --- Related Databases --- p.8Chapter 2.1.3.1 --- TRANSFAC - Experimentally Determined Database --- p.8Chapter 2.1.3.2 --- cisRED - Computational Determined Database --- p.9Chapter 2.1.3.3 --- ORegAnno - Community Driven Database --- p.10Chapter 2.2 --- Evolutionary Algorithms --- p.13Chapter 2.2.1 --- Representation --- p.15Chapter 2.2.2 --- Parent Selection --- p.16Chapter 2.2.3 --- Crossover Operators --- p.17Chapter 2.2.4 --- Mutation Operators --- p.18Chapter 2.2.5 --- Survival Selection --- p.19Chapter 2.2.6 --- Termination Condition --- p.19Chapter 2.2.7 --- Discussion --- p.19Chapter 2.2.8 --- Examples --- p.19Chapter 2.2.8.1 --- Genetic Algorithm --- p.20Chapter 2.2.8.2 --- Genetic Programming --- p.21Chapter 2.2.8.3 --- Differential Evolution --- p.21Chapter 2.2.8.4 --- Evolution Strategy --- p.22Chapter 2.2.8.5 --- Swarm Intelligence --- p.23Chapter 2.3 --- Association Rule Mining --- p.24Chapter 2.3.1 --- Objective --- p.24Chapter 2.3.2 --- Apriori Algorithm --- p.24Chapter 2.3.3 --- Partition Algorithm --- p.25Chapter 2.3.4 --- DHP --- p.25Chapter 2.3.5 --- Sampling --- p.25Chapter 2.3.6 --- Frequent Pattern Tree --- p.26Chapter 3 --- Discovering Protein-DNA Binding Sequence Patterns Using Associa- tion Rule Mining --- p.27Chapter 3.1 --- Materials and Methods --- p.28Chapter 3.1.1 --- Association Rule Mining and Apriori Algorithm --- p.29Chapter 3.1.2 --- Discovering associated TF-TFBS sequence patterns --- p.29Chapter 3.1.3 --- "Data, Preparation" --- p.31Chapter 3.2 --- Results and Analysis --- p.34Chapter 3.2.1 --- Rules Discovered --- p.34Chapter 3.2.2 --- Quantitative Analysis --- p.36Chapter 3.2.3 --- Annotation Analysis --- p.37Chapter 3.2.4 --- Empirical Analysis --- p.37Chapter 3.2.5 --- Experimental Analysis --- p.38Chapter 3.3 --- Verifications --- p.41Chapter 3.3.1 --- Verification by PDB --- p.41Chapter 3.3.2 --- Verification by Homology Modeling --- p.45Chapter 3.3.3 --- Verification by Random Analysis --- p.45Chapter 3.4 --- Discussion --- p.49Chapter 4 --- Designing Evolutionary Algorithms for Multimodal Optimization --- p.50Chapter 4.1 --- Introduction --- p.50Chapter 4.2 --- Problem Definition --- p.51Chapter 4.2.1 --- Minimization --- p.51Chapter 4.2.2 --- Maximization --- p.51Chapter 4.3 --- An Evolutionary Algorithm with Species-specific Explosion for Multi- modal Optimization --- p.52Chapter 4.3.1 --- Background --- p.52Chapter 4.3.1.1 --- Species Conserving Genetic Algorithm --- p.52Chapter 4.3.2 --- Evolutionary Algorithm with Species-specific Explosion --- p.53Chapter 4.3.2.1 --- Species Identification --- p.53Chapter 4.3.2.2 --- Species Seed Delta Evaluation --- p.55Chapter 4.3.2.3 --- Stage Switching Condition --- p.56Chapter 4.3.2.4 --- Species-specific Explosion --- p.57Chapter 4.3.2.5 --- Calculate Explosion Weights --- p.59Chapter 4.3.3 --- Experiments --- p.59Chapter 4.3.3.1 --- Performance measurement --- p.60Chapter 4.3.3.2 --- Parameter settings --- p.61Chapter 4.3.3.3 --- Results --- p.61Chapter 4.3.4 --- Conclusion --- p.62Chapter 4.4 --- A. Crowding Genetic. Algorithm with Spatial Locality for Multimodal Op- timization --- p.64Chapter 4.4.1 --- Background --- p.64Chapter 4.4.1.1 --- Crowding Genetic Algorithm --- p.64Chapter 4.4.1.2 --- Locality of Reference --- p.64Chapter 4.4.2 --- Crowding Genetic Algorithm with Spatial Locality --- p.65Chapter 4.4.2.1 --- Motivation --- p.65Chapter 4.4.2.2 --- Offspring generation with spatial locality --- p.65Chapter 4.4.3 --- Experiments --- p.67Chapter 4.4.3.1 --- Performance measurements --- p.67Chapter 4.4.3.2 --- Parameter setting --- p.68Chapter 4.4.3.3 --- Results --- p.68Chapter 4.4.4 --- Conclusion --- p.68Chapter 5 --- Generalizing Protein-DNA Binding Sequence Representations and Learn- ing using an Evolutionary Algorithm for Multimodal Optimization --- p.70Chapter 5.1 --- Introduction and Background --- p.70Chapter 5.2 --- Problem Definition --- p.72Chapter 5.3 --- Crowding Genetic Algorithm with Spatial Locality --- p.72Chapter 5.3.1 --- Representation --- p.72Chapter 5.3.2 --- Crossover Operators --- p.73Chapter 5.3.3 --- Mutation Operators --- p.73Chapter 5.3.4 --- Fitness Function --- p.74Chapter 5.3.5 --- Distance Metric --- p.76Chapter 5.4 --- Experiments --- p.77Chapter 5.4.1 --- Parameter Setting --- p.77Chapter 5.4.2 --- Search Space Estimation --- p.78Chapter 5.4.3 --- Experimental Procedure --- p.78Chapter 5.4.4 --- Results and Analysis --- p.79Chapter 5.4.4.1 --- Generalization Analysis --- p.79Chapter 5.4.4.2 --- Verification By PDB --- p.86Chapter 5.5 --- Conclusion --- p.87Chapter 6 --- Predicting Protein Structures on a Lattice Model using an Evolution- ary Algorithm for Multimodal Optimization --- p.88Chapter 6.1 --- Introduction --- p.88Chapter 6.2 --- Problem Definition --- p.89Chapter 6.3 --- Representation --- p.90Chapter 6.4 --- Related Works --- p.91Chapter 6.5 --- Crowding Genetic Algorithm with Spatial Locality --- p.92Chapter 6.5.1 --- Motivation --- p.92Chapter 6.5.2 --- Customization --- p.92Chapter 6.5.2.1 --- Distance metrics --- p.92Chapter 6.5.2.2 --- Handling infeasible conformations --- p.93Chapter 6.6 --- Experiments --- p.94Chapter 6.6.1 --- Performance Metrics --- p.94Chapter 6.6.2 --- Parameter Settings --- p.94Chapter 6.6.3 --- Results --- p.94Chapter 6.7 --- Conclusion --- p.95Chapter 7 --- Conclusion and Future Work --- p.97Chapter 7.1 --- Thesis Contribution --- p.97Chapter 7.2 --- Fixture Work --- p.98Chapter A --- Appendix --- p.99Chapter A.1 --- Problem Definition in Chapter 3 --- p.107Bibliography --- p.109Author's Publications --- p.12

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

    Biological Networks: Modeling and Structural Analysis

    Get PDF
    Biological networks are receiving increased attention due to their importance in understanding life at the cellular level. There exist many different kinds of biological networks, and different models have been proposed for them. In this dissertation we focus on suitable network models for representing experimental data on protein interaction networks and protein complex networks (protein complexes are groups of proteins that associate to accomplish some function in the cell), and to design algorithms for exploring such networks. Our goal is to enable biologists to identify the general principles that govern the organization of protein-protein interaction networks and protein complex networks. For protein complex networks, we propose a hypergraph model which more accurately represents the data than earlier models. We define the concept of k-cores in hypergraphs, which are highly connected subhypergraphs, and design an algorithm for computing k -cores in hypergraphs. A major challenge in computational systems biology is to understand the modular structure of biological networks. We construct computational models for predicting functional modules through the use of graph clustering techniques. The application of earlier graph clustering techniques to proteomic networks does not yield good results due to the high error rates present, and the small-world and power-law properties of these networks. We discuss the various requirements that clusterings of biological networks are required to satisfy, design an algorithm for computing a clustering, and show that our clustering approach is robust and scalable. Moreover, we design a new algorithm to compute overlapping clustering rather than exclusive clustering. Our approach identifies a set of clusters and a set of bridge proteins that form the overlap among the clusters. Finally we assess the quality of our proposed clusterings using different reference sets

    Multi-species integrative biclustering

    Get PDF
    We describe an algorithm, multi-species cMonkey, for the simultaneous biclustering of heterogeneous multiple-species data collections and apply the algorithm to a group of bacteria containing Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes. The algorithm reveals evolutionary insights into the surprisingly high degree of conservation of regulatory modules across these three species and allows data and insights from well-studied organisms to complement the analysis of related but less well studied organisms

    Machine Learning Applications for Drug Repurposing

    Full text link
    The cost of bringing a drug to market is astounding and the failure rate is intimidating. Drug discovery has been of limited success under the conventional reductionist model of one-drug-one-gene-one-disease paradigm, where a single disease-associated gene is identified and a molecular binder to the specific target is subsequently designed. Under the simplistic paradigm of drug discovery, a drug molecule is assumed to interact only with the intended on-target. However, small molecular drugs often interact with multiple targets, and those off-target interactions are not considered under the conventional paradigm. As a result, drug-induced side effects and adverse reactions are often neglected until a very late stage of the drug discovery, where the discovery of drug-induced side effects and potential drug resistance can decrease the value of the drug and even completely invalidate the use of the drug. Thus, a new paradigm in drug discovery is needed. Structural systems pharmacology is a new paradigm in drug discovery that the drug activities are studied by data-driven large-scale models with considerations of the structures and drugs. Structural systems pharmacology will model, on a genome scale, the energetic and dynamic modifications of protein targets by drug molecules as well as the subsequent collective effects of drug-target interactions on the phenotypic drug responses. To date, however, few experimental and computational methods can determine genome-wide protein-ligand interaction networks and the clinical outcomes mediated by them. As a result, the majority of proteins have not been charted for their small molecular ligands; we have a limited understanding of drug actions. To address the challenge, this dissertation seeks to develop and experimentally validate innovative computational methods to infer genome-wide protein-ligand interactions and multi-scale drug-phenotype associations, including drug-induced side effects. The hypothesis is that the integration of data-driven bioinformatics tools with structure-and-mechanism-based molecular modeling methods will lead to an optimal tool for accurately predicting drug actions and drug associated phenotypic responses, such as side effects. This dissertation starts by reviewing the current status of computational drug discovery for complex diseases in Chapter 1. In Chapter 2, we present REMAP, a one-class collaborative filtering method to predict off-target interactions from protein-ligand interaction network. In our later work, REMAP was integrated with structural genomics and statistical machine learning methods to design a dual-indication polypharmacological anticancer therapy. In Chapter 3, we extend REMAP, the core method in Chapter 2, into a multi-ranked collaborative filtering algorithm, WINTF, and present relevant mathematical justifications. Chapter 4 is an application of WINTF to repurpose an FDA-approved drug diazoxide as a potential treatment for triple negative breast cancer, a deadly subtype of breast cancer. In Chapter 5, we present a multilayer extension of REMAP, applied to predict drug-induced side effects and the associated biological pathways. In Chapter 6, we close this dissertation by presenting a deep learning application to learn biochemical features from protein sequence representation using a natural language processing method

    Deep learning methods for mining genomic sequence patterns

    Get PDF
    Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine. This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation. There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

    Application of machine learning techniques on the discovery and annotation of transposons in genomes

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e computação. Faculdade de Engenharia. Universidade do Porto. 201
    corecore