184 research outputs found

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    Characterizing and Accelerating Bioinformatics Workloads on Modern Microarchitectures

    Get PDF
    Bioinformatics, the use of computer techniques to analyze biological data, has been a particularly active research field in the last two decades. Advances in this field have contributed to the collection of enormous amounts of data, and the sheer amount of available data has started to overtake the processing capability possible with current computer systems. Clearly, computer architects need to have a better understanding of how bioinformatics applications work and what kind of architectural techniques could be used to accelerate these important scientific workloads on future processors. In this dissertation, we develop a bioinformatic benchmark suite and provide a detailed characterization of these applications in common use today from a computer architect's point of view. We analyze a wide range of detailed execution characteristics including instruction mix, IPC measurements, L1 and L2 cache misses on a real architecture; and proceed to analyze the workloads' memory access characteristics. We then concentrate on accelerating a particularly computationally intensive bioinformatics workload on the novel Cell Broadband Engine multiprocessor architecture. The HMMER workload is used for protein profile searching using hidden Markov models, and most of its execution time is spent running the Viterbi algorithm. We parallelize and partition the HMMER application to implement it on the Cell Broadband Engine. In order to run the Viterbi algorithm on the 256KB local stores of the Cell BE synergistic processing units (SPEs), we present a method to develop a fast SIMD implementation of the Viterbi algorithm that reduces the storage requirements significantly. Our HMMER implementation for the Cell BE architecture, Cell-HMMER, exploits the multiple levels of parallelism inherent in this application, and can run protein profile searches up to 27.98 times faster than a modern dual-core x86 microprocessor

    Probabilistic grammatical model of protein language and its application to helix-helix contact site classification

    Get PDF
    BACKGROUND: Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. RESULTS: In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. CONCLUSIONS: We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    Evolutionary Computation

    Get PDF
    This book presents several recent advances on Evolutionary Computation, specially evolution-based optimization methods and hybrid algorithms for several applications, from optimization and learning to pattern recognition and bioinformatics. This book also presents new algorithms based on several analogies and metafores, where one of them is based on philosophy, specifically on the philosophy of praxis and dialectics. In this book it is also presented interesting applications on bioinformatics, specially the use of particle swarms to discover gene expression patterns in DNA microarrays. Therefore, this book features representative work on the field of evolutionary computation and applied sciences. The intended audience is graduate, undergraduate, researchers, and anyone who wishes to become familiar with the latest research work on this field

    ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis

    Full text link
    Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. We identify an urgent need for a flexible, high-performance, and energy-efficient HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out negligible computations using a hardware-based filter, and 4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and 27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x - 1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC

    The 5th Conference of PhD Students in Computer Science

    Get PDF

    Integrated data-driven techniques for environmental pollution monitoring

    Get PDF
    The adverse health e_x000B_ffects of tropospheric ozone around urban zones indicate a substantial risk for many segments of the population. This necessitates the short term forecast in order to take evasive action on days conducive to ozone formation. Therefore it is important to study the ozone formation mechanisms and predict the ozone levels in a geographic region. Multivariate statistical techniques provide a very e_x000B_ffective framework for the classifi_x000C_cation and monitoring of systems with multiple variables. Cluster analysis, sequence analysis and hidden Markov models (HMMs) are statistical methods which have been used in a wide range of studies to model the data structure. In this dissertation, we propose to formulate, implement and apply a data-driven computational framework for air quality monitoring and forecasting with application to ozone formation. The proposed framework integrates, in a unique way, advanced statistical data processing and analysis tools to investigate ozone formation mechanisms and predict the ozone levels in a geographic region. This dissertation focuses on cluster analysis for identi_x000C_fication and classi_x000C_fication of underlying mechanisms of a system and HMMs for predicting the occurrence of an extreme event in a system. The usefulness of the proposed methodology in air quality monitoring is demonstrated by applying it to study the ozone problem in Houston, Texas and Baton Rouge, Louisiana regions. Hierarchical clustering is used to visualize air flow patterns at two time scales relevant for ozone buildup. First, clustering is performed at the hourly time scale to identify surface flow patterns. Then, sequencing is performed at the daily time scale to identify groups of days sharing similar diurnal cycles for the surface flow. Selection of appropriate numbers of air flow patterns allowed inference of regional transport and dispersion patterns for understanding population exposure to ozone. This dissertation proposes to build HMMs for ozone prediction using air quality and meteorological measurements obtained from a network of surface monitors. The case study of the Houston, Texas region for the 2004 and 2005 ozone seasons showed that the results indicate the capability of HMMs as a simpler forecasting tool
    • 

    corecore