2,685 research outputs found
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches
Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set
Pattern-based phylogenetic distance estimation and tree reconstruction
We have developed an alignment-free method that calculates phylogenetic
distances using a maximum likelihood approach for a model of sequence change on
patterns that are discovered in unaligned sequences. To evaluate the
phylogenetic accuracy of our method, and to conduct a comprehensive comparison
of existing alignment-free methods (freely available as Python package decaf+py
at http://www.bioinformatics.org.au), we have created a dataset of reference
trees covering a wide range of phylogenetic distances. Amino acid sequences
were evolved along the trees and input to the tested methods; from their
calculated distances we infered trees whose topologies we compared to the
reference trees.
We find our pattern-based method statistically superior to all other tested
alignment-free methods on this dataset. We also demonstrate the general
advantage of alignment-free methods over an approach based on automated
alignments when sequences violate the assumption of collinearity. Similarly, we
compare methods on empirical data from an existing alignment benchmark set that
we used to derive reference distances and trees. Our pattern-based approach
yields distances that show a linear relationship to reference distances over a
substantially longer range than other alignment-free methods. The pattern-based
approach outperforms alignment-free methods and its phylogenetic accuracy is
statistically indistinguishable from alignment-based distances.Comment: 21 pages, 3 figures, 2 table
Div-BLAST: Diversification of sequence search results
Cataloged from PDF version of article.Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of
sequences. They return results significantly similar to the query sequence and that are typically highly
similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach,
where the initial results guide the user to new searches. However, diversity has not yet been considered an
integral component of sequence search tools for this discipline. Some redundancy can be avoided by
introducing non-redundancy during database construction, but it is not feasible to dynamically set a level
of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing
in sequence databases that produce non-redundant results optimized for any given query. We define
diversity measures for sequences and propose methods to obtain diverse results extracted from current
sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of
sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the
proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional
diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a
comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed
methods are able to achieve more diverse yet significant result sets compared to static non-redundancy
approaches. In both sequence-based and functional diversity evaluation, the proposed diversification
methods significantly outperform original BLAST results and other baselines. A web based tool
implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAS
Short-range template switching in great ape genomes explored using pair hidden Markov models.
Many complex genomic rearrangements arise through template switch errors, which occur in DNA replication when there is a transient polymerase switch to an alternate template nearby in three-dimensional space. While typically investigated at kilobase-to-megabase scales, the genomic and evolutionary consequences of this mutational process are not well characterised at smaller scales, where they are often interpreted as clusters of independent substitutions, insertions and deletions. Here we present an improved statistical approach using pair hidden Markov models, and use it to detect and describe short-range template switches underlying clusters of mutations in the multi-way alignment of hominid genomes. Using robust statistics derived from evolutionary genomic simulations, we show that template switch events have been widespread in the evolution of the great apes' genomes and provide a parsimonious explanation for the presence of many complex mutation clusters in their phylogenetic context. Larger-scale mechanisms of genome rearrangement are typically associated with structural features around breakpoints, and accordingly we show that atypical patterns of secondary structure formation and DNA bending are present at the initial template switch loci. Our methods improve on previous non-probabilistic approaches for computational detection of template switch mutations, allowing the statistical significance of events to be assessed. By specifying realistic evolutionary parameters based on the genomes and taxa involved, our methods can be readily adapted to other intra- or inter-species comparisons
The accuracy of several multiple sequence alignment programs for proteins
BACKGROUND: There have been many algorithms and software programs implemented for the inference of multiple sequence alignments of protein and DNA sequences. The "true" alignment is usually unknown due to the incomplete knowledge of the evolutionary history of the sequences, making it difficult to gauge the relative accuracy of the programs. RESULTS: We tested nine of the most often used protein alignment programs and compared their results using sequences generated with the simulation software Simprot which creates known alignments under realistic and controlled evolutionary scenarios. We have simulated more than 30000 alignment sets using various evolutionary histories in order to define strengths and weaknesses of each program tested. We found that alignment accuracy is extremely dependent on the number of insertions and deletions in the sequences, and that indel size has a weaker effect. We also considered benchmark alignments from the latest version of BAliBASE and the results relative to BAliBASE- and Simprot-generated data sets were consistent in most cases. CONCLUSION: Our results indicate that employing Simprot's simulated sequences allows the creation of a more flexible and broader range of alignment classes than the usual methods for alignment accuracy assessment. Simprot also allows for a quick and efficient analysis of a wider range of possible evolutionary histories that might not be present in currently available alignment sets. Among the nine programs tested, the iterative approach available in Mafft (L-INS-i) and ProbCons were consistently the most accurate, with Mafft being the faster of the two
Recommended from our members
Statistical analysis of short template switch mutations in human genomes
Many complex rearrangements arise in human genomes through template switch mutations, which occur during DNA replication when there is a transient polymerase switch to an alternate template nearby in three-dimensional space. These variants are routinely captured at kilobase-to-megabase scales in studies of genetic variation by using methods for structural variant calling. However, the genomic and evolutionary consequences of replication-based rearrangements remain poorly characterised at smaller scales, where they are usually interpreted as complex clusters of independent substitutions, insertions and deletions. In this thesis, I describe statistical methods for the detection and interpretation of short template switch mutations within DNA sequence data. I then use my methods to explore small-scale template switch mutagenesis within human genome evolution, population variation, and cancer. I show that small-scale, replication- based rearrangements are a ubiquitous feature of the germline and somatic mutational landscape of human genomes.European Molecular Biology Laboratory
National Institute for Health Researc
Alignment-free Genomic Analysis via a Big Data Spark Platform
Motivation: Alignment-free distance and similarity functions (AF functions,
for short) are a well established alternative to two and multiple sequence
alignments for many genomic, metagenomic and epigenomic tasks. Due to
data-intensive applications, the computation of AF functions is a Big Data
problem, with the recent Literature indicating that the development of fast and
scalable algorithms computing AF functions is a high-priority task. Somewhat
surprisingly, despite the increasing popularity of Big Data technologies in
Computational Biology, the development of a Big Data platform for those tasks
has not been pursued, possibly due to its complexity. Results: We fill this
important gap by introducing FADE, the first extensible, efficient and scalable
Spark platform for Alignment-free genomic analysis. It supports natively
eighteen of the best performing AF functions coming out of a recent hallmark
benchmarking study. FADE development and potential impact comprises novel
aspects of interest. Namely, (a) a considerable effort of distributed
algorithms, the most tangible result being a much faster execution time of
reference methods like MASH and FSWM; (b) a software design that makes FADE
user-friendly and easily extendable by Spark non-specialists; (c) its ability
to support data- and compute-intensive tasks. About this, we provide a novel
and much needed analysis of how informative and robust AF functions are, in
terms of the statistical significance of their output. Our findings naturally
extend the ones of the highly regarded benchmarking study, since the functions
that can really be used are reduced to a handful of the eighteen included in
FADE
- …