2,260 research outputs found
MissMax: Alignment-free sequence comparison with mismatches through filtering and heuristics
BACKGROUND: Measuring sequence similarity is central for many problems in bioinformatics. In several contexts alignment-free techniques based on exact occurrences of substrings are faster, but also less accurate, than alignment-based approaches. Recently, several studies attempted to bridge the accuracy gap with the introduction of approximate matches in the definition of composition-based similarity measures. RESULTS: In this work we present MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of a sequence x and a sequence y. This collection of statistics is useful for the computation of two similarity measures: the longest and the average common substring with k mismatches. As a further contribution we provide a “relaxed” version of MissMax that does not guarantee the exact solution, but it is faster in practice and still very precise
Estimating seed sensitivity on homogeneous alignments
We address the problem of estimating the sensitivity of seed-based similarity
search algorithms. In contrast to approaches based on Markov models [18, 6, 3,
4, 10], we study the estimation based on homogeneous alignments. We describe an
algorithm for counting and random generation of those alignments and an
algorithm for exact computation of the sensitivity for a broad class of seed
strategies. We provide experimental results demonstrating a bias introduced by
ignoring the homogeneousness condition
Parametric Alignment of Drosophila Genomes
The classic algorithms of Needleman--Wunsch and Smith--Waterman find a
maximum a posteriori probability alignment for a pair hidden Markov model
(PHMM). In order to process large genomes that have undergone complex genome
rearrangements, almost all existing whole genome alignment methods apply fast
heuristics to divide genomes into small pieces which are suitable for
Needleman--Wunsch alignment. In these alignment methods, it is standard
practice to fix the parameters and to produce a single alignment for subsequent
analysis by biologists.
Our main result is the construction of a whole genome parametric alignment of
Drosophila melanogaster and Drosophila pseudoobscura. Parametric alignment
resolves the issue of robustness to changes in parameters by finding all
optimal alignments for all possible parameters in a PHMM. Our alignment draws
on existing heuristics for dividing whole genomes into small pieces for
alignment, and it relies on advances we have made in computing convex polytopes
that allow us to parametrically align non-coding regions using biologically
realistic models. We demonstrate the utility of our parametric alignment for
biological inference by showing that cis-regulatory elements are more conserved
between Drosophila melanogaster and Drosophila pseudoobscura than previously
thought. We also show how whole genome parametric alignment can be used to
quantitatively assess the dependence of branch length estimates on alignment
parameters.
The alignment polytopes, software, and supplementary material can be
downloaded at http://bio.math.berkeley.edu/parametric/.Comment: 19 pages, 3 figure
PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text
Current computational methods for exon-intron structure prediction from a
cluster of transcript (EST, mRNA) data do not exhibit the time and space
efficiency necessary to process large clusters of over than 20,000 ESTs and
genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a
computational goal quite far to be achieved, since accuracy is strictly related
to exploiting the inherent redundancy of information present in a large
cluster. We propose a fast method for the problem that combines two ideas: a
novel algorithm of proved small time complexity for computing spliced
alignments of a transcript against a genome, and an efficient algorithm that
exploits the inherent redundancy of information in a cluster of transcripts to
select, among all possible factorizations of EST sequences, those allowing to
infer splice site junctions that are highly confirmed by the input data. The
EST alignment procedure is based on the construction of maximal embeddings that
are sequences obtained from paths of a graph structure, called Embedding Graph,
whose vertices are the maximal pairings of a genomic sequence T and an EST P.
The procedure runs in time linear in the size of P, T and of the output.
PIntron, the software tool implementing our methodology, is able to process in
a few seconds some critical genes that are not manageable by other gene
structure prediction tools. At the same time, PIntron exhibits high accuracy
(sensitivity and specificity) when compared with ENCODE data. Detailed
experimental data, additional results and PIntron software are available at
http://www.algolab.eu/PIntron
Linear-time Computation of Minimal Absent Words Using Suffix Array
An absent word of a word y of length n is a word that does not occur in y. It
is a minimal absent word if all its proper factors occur in y. Minimal absent
words have been computed in genomes of organisms from all domains of life;
their computation provides a fast alternative for measuring approximation in
sequence comparison. There exists an O(n)-time and O(n)-space algorithm for
computing all minimal absent words on a fixed-sized alphabet based on the
construction of suffix automata (Crochemore et al., 1998). No implementation of
this algorithm is publicly available. There also exists an O(n^2)-time and
O(n)-space algorithm for the same problem based on the construction of suffix
arrays (Pinho et al., 2009). An implementation of this algorithm was also
provided by the authors and is currently the fastest available. In this
article, we bridge this unpleasant gap by presenting an O(n)-time and
O(n)-space algorithm for computing all minimal absent words based on the
construction of suffix arrays. Experimental results using real and synthetic
data show that the respective implementation outperforms the one by Pinho et
al
- …