Search CORE

5,061 research outputs found

Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

Author: A Kraskov
A Milosavljević
G Navarro
J Felsenstein
J Lake
J Rissanen
J Rissanen
J Thompson
J Varre
Konrad Scheffler
L Allison
M Brudno
M Brudno
M Cao
M Li
M Li
M Mahoney
M Nei
M Steel
Maya Paczuski
N Bray
N Bray
N Saitou
Orion Penner
P Buneman
P Lockhart
P Viola
Peter Grassberger
R Cilibrasi
R Durbin
S Altschul
S Altschul
S McGinnis
S Vinga
T Cover
T Lassmann
W Press
X Chen
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 19/08/2010
Field of study

Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the "normalized compression distance". So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. Due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics, but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

IMT Institutional Repository

Back-translation for discovering distant protein homologies

Author: A. Pedersen
B. Oostra
C. Kosiol
J. Leluk
J. Leluk
J. Raes
K. Okamura
L. Arvestad
L. Delaye
M. Clamp
M. Pellegrini
P. Harrison
P. Lio
R. Blake
S. Altschul
S. Altschul
S. Altschul
Y. Hahn
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. To cope with this situation, we propose a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.Comment: The 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia : \'Etats-Unis d'Am\'erique (2009

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Sequence-specific sequence comparison using pairwise statistical significance

Author: Agrawal Ankit
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2009
Field of study

Sequence comparison is one of the most fundamental computational problems in bioinformatics for which many approaches have been and are still being developed. In particular, pairwise sequence alignment forms the crux of both DNA and protein sequence comparison techniques, which in turn forms the basis of many other applications in bioinformatics. Pairwise sequence alignment methods align two sequences using a substitution matrix consisting of pairwise scores of aligning different residues with each other (like BLOSUM62), and give an alignment score for the given sequence-pair. The biologists routinely use such pairwise alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is widely accepted that the relatedness of two sequences is better judged by statistical significance of the alignment score rather than by the alignment score alone. This research addresses the problem of accurately estimating statistical significance of pairwise alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence-specific. The major contributions of this research work are as follows. Firstly, using sequence-specific strategies for pairwise sequence alignment in conjunction with sequence-specific strategies for statistical significance estimation, wherein accurate methods for pairwise statistical significance estimation using standard, sequence-specific, and position-specific substitution matrices are developed. Secondly, using pairwise statistical significance to improve the performance of the most popular database search program PSI-BLAST. Thirdly, design and implementation of heuristics to speed-up pairwise statistical significance estimation by an factor of more than 200. The implementation of all the methods developed in this work is freely available online. With the all-pervasive application of sequence alignment methods in bioinformatics using the ever-increasing sequence data, this work is expected to offer useful contributions to the research community

Digital Repository @ Iowa State University (ISU)

Sequence Alignment in Molecular Biology

Author: Apostolico Alberto
Fiancarlo Raffaele
Publication venue: 'Purdue University (bepress)'
Publication date: 01/11/1995
Field of study

Purdue E-Pubs

Pattern-based phylogenetic distance estimation and tree reconstruction

Author: Höhl Michael
Ragan Mark A.
Rigoutsos Isidore
Publication venue
Publication date: 01/01/2006
Field of study

We have developed an alignment-free method that calculates phylogenetic distances using a maximum likelihood approach for a model of sequence change on patterns that are discovered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and to conduct a comprehensive comparison of existing alignment-free methods (freely available as Python package decaf+py at http://www.bioinformatics.org.au), we have created a dataset of reference trees covering a wide range of phylogenetic distances. Amino acid sequences were evolved along the trees and input to the tested methods; from their calculated distances we infered trees whose topologies we compared to the reference trees. We find our pattern-based method statistically superior to all other tested alignment-free methods on this dataset. We also demonstrate the general advantage of alignment-free methods over an approach based on automated alignments when sequences violate the assumption of collinearity. Similarly, we compare methods on empirical data from an existing alignment benchmark set that we used to derive reference distances and trees. Our pattern-based approach yields distances that show a linear relationship to reference distances over a substantially longer range than other alignment-free methods. The pattern-based approach outperforms alignment-free methods and its phylogenetic accuracy is statistically indistinguishable from alignment-based distances.Comment: 21 pages, 3 figures, 2 table

arXiv.org e-Print Archive

CiteSeerX

Directory of Open Access Journals

PubMed Central

University of Queensland eSpace

Recommended from our members

Protein Fold Recognition Using Neural Networks

Author: Lin Guang
Publication venue
Publication date: 01/01/2003
Field of study

To predict accurately the three-dimensional (3D) structures of proteins from their amino acid sequences alone remains a challenging problem. However, using protein fold recognition tools, it is often possible to achieve good models or at least to gain some more information, to aid scientists in their research. This thesis describes development of TUNE (Threading Using Neural Networks), a fold recognition program using artificial neural network (ANN) models. A new method to generate amino acid substitution matrices is described in chapter two. It uses an ANN to generalise amino acid substitutions observed in protein structure alignments. Matrices for alignment scoring from this approach were compared with classic alignment scoring schemes. From these neural network models, a series of encoding schemes were constructed. These schemes describe the amino acid types with a few numbers. They were generated to replace the orthogonal encoding scheme, so that smaller, faster and more accurate neural network models can be applied on bioinformatic problems. The TUNE model was introduced in chapter four to measure protein sequence-structure compatibility. Given the integrated residue structural environment descriptions, the model predicts probabilities of observing amino acid types in such environments. Using this model, a scoring function to measure the fitness of a residue in a protein structure model can be made for protein threading programs. The model in chapter two was extended by including the residue structural environment descriptions for predictions. A simple protein fold recognition program with a dynamic programming algorithm was developed using this model. The program was then tested in the fourth round of the Critical Assessment of protein Structure Prediction methods (CASP4) and produced reasonably good results

Open Research Online (The Open University)

OpenGrey Repository

Optimal Sequence Alignment and Its Relationship with Phylogeny

Author: Atoosa Ghahremani
Mahmood A. Mahdavi
Publication venue: 'IntechOpen'
Publication date: 02/11/2011
Field of study

IntechOpen

Alignment-free Genomic Analysis via a Big Data Spark Platform

Author: Cattaneo Giuseppe
Giancarlo Raffaele
Palini Francesco
Petrillo Umberto Ferraro
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2021
Field of study

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza