7,116 research outputs found

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Malware Detection Using Dynamic Analysis

    Get PDF
    In this research, we explore the field of dynamic analysis which has shown promis- ing results in the field of malware detection. Here, we extract dynamic software birth- marks during malware execution and apply machine learning based detection tech- niques to the resulting feature set. Specifically, we consider Hidden Markov Models and Profile Hidden Markov Models. To determine the effectiveness of this dynamic analysis approach, we compare our detection results to the results obtained by using static analysis. We show that in some cases, significantly stronger results can be obtained using our dynamic approach

    Profile Context-Sensitive HMMs for Probabilistic Modeling of Sequences With Complex Correlations

    Get PDF
    The profile hidden Markov model is a specific type of HMM that is well suited for describing the common features of a set of related sequences. It has been extensively used in computational biology, where it is still one of the most popular tools. In this paper, we propose a new model called the profile context-sensitive HMM. Unlike traditional profile-HMMs, the proposed model is capable of describing complex long-range correlations between distant symbols in a consensus sequence. We also introduce a general algorithm that can be used for finding the optimal state-sequence of an observed symbol sequence based on the given profile-csHMM. The proposed model has an important application in RNA sequence analysis, especially in modeling and analyzing RNA pseudoknots

    DeepSF: deep convolutional neural network for mapping protein sequences to folds

    Get PDF
    Motivation Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a tar get protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein se quence into one of 1195 known folds, which is useful for both fold recognition and the study of se quence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and map it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 77.0%. We compare our method with a top profile profile alignment method - HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 14.5%-29.1% higher than HHSearch on template-free modeling targets and 4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking.Comment: 28 pages, 13 figure

    Genetic Barcode Identification With Profile Hidden Markov Models

    Get PDF
    DNA barcoding is a method that uses an organism’s DNA to identify its species. The gene cytochrome c oxidase I (COI) has been used effectively as a DNA barcode to identify organisms and elucidate relationships among species [1]. There also exists a database BOLD (Barcode Of Life Database) that contains COI sequences used for DNA barcoding for more than 1 million different species. Using BOLD to identify samples that have a match in the database is an uncomplicated process. However, this method fails to determine samples that are absent from the database. Given a sample that is not represented in BOLD but is similar to a represented sequence, it would be valuable to describe the sample at a higher taxonomic classification. Since COI is represented as long character sequences of amino acids, Hidden Markov Models (HMMs) can be used to associate an unknown DNA sequence with a taxonomic rank. In this work, I show that dynamically created Profile HMMs are an effective tool for such identification
    • …
    corecore