139 research outputs found
A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances
Spaced seeds have been recently shown to not only detect more alignments, but
also to give a more accurate measure of phylogenetic distances (Boden et al.,
2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower
misclassification rate when used with Support Vector Machines (SVMs) (On-odera
and Shibuya, 2013), We confirm by independent experiments these two results,
and propose in this article to use a coverage criterion (Benson and Mak, 2008,
Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both
cases in order to design better seed patterns. We show first how this coverage
criterion can be directly measured by a full automaton-based approach. We then
illustrate how this criterion performs when compared with two other criteria
frequently used, namely the single-hit and multiple-hit criteria, through
correlation coefficients with the correct classification/the true distance. At
the end, for alignment-free distances, we propose an extension by adopting the
coverage criterion, show how it performs, and indicate how it can be
efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017
Reconsidering the significance of genomic word frequency
We propose that the distribution of DNA words in genomic sequences can be
primarily characterized by a double Pareto-lognormal distribution, which
explains lognormal and power-law features found across all known genomes. Such
a distribution may be the result of completely random sequence evolution by
duplication processes. The parametrization of genomic word frequencies allows
for an assessment of significance for frequent or rare sequence motifs
YASS: enhancing the sensitivity of DNA similarity search
YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transition-constrained seeds to specify the most probable conserved motifs between homologous sequences, combined with a flexible hit criterion used to identify groups of seeds that are likely to exhibit significant alignments. A web interface () is available to upload input sequences in fasta format, query the program and visualize the results obtained in several forms (dot-plot, tabular output and others). A standalone version is available for download from the web page
Improved hit criteria for DNA local alignment
BACKGROUND: The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. RESULTS: In this paper, we propose two ways to improve the hit criterion. First, we define the group criterion combining the advantages of the single-seed and double-seed approaches used in existing algorithms. Second, we introduce transition-constrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with the YASS software, supporting both improvements. CONCLUSIONS: Proposed algorithmic ideas allow to obtain a significant gain in sensitivity of similarity search without increase in execution time. The method has been implemented in YASS software available at
Back-translation for discovering distant protein homologies in the presence of frameshift mutations
Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins ’ common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. \ud
\ud
Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at http://bioinfo.lifl.fr/path/.\ud
\ud
Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional\ud
alignment methods, which is confirmed by biologically significant example
Designing Efficient Spaced Seeds for SOLiD Read Mapping
The advent of high-throughput sequencing technologies constituted
a major advance in genomic studies, offering new prospects in a
wide range of applications.We propose a rigorous and flexible algorithmic
solution to mapping SOLiD color-space reads to a reference genome. The
solution relies on an advanced method of seed design that uses a faithful
probabilistic model of read matches and, on the other hand, a novel
seeding principle especially adapted to read mapping. Our method can
handle both lossy and lossless frameworks and is able to distinguish, at
the level of seed design, between SNPs and reading errors. We illustrate
our approach by several seed designs and demonstrate their efficiency
Efficient seeding techniques for protein similarity search
We apply the concept of subset seeds proposed in [1] to similarity search in
protein sequences. The main question studied is the design of efficient seed
alphabets to construct seeds with optimal sensitivity/selectivity trade-offs.
We propose several different design methods and use them to construct several
alphabets.We then perform an analysis of seeds built over those alphabet and
compare them with the standard Blastp seeding method [2,3], as well as with the
family of vector seeds proposed in [4]. While the formalism of subset seed is
less expressive (but less costly to implement) than the accumulative principle
used in Blastp and vector seeds, our seeds show a similar or even better
performance than Blastp on Bernoulli models of proteins compatible with the
common BLOSUM62 matrix
Efficient seeding techniques for protein similarity search
We apply the concept of subset seeds proposed in [1] to similarity search in
protein sequences. The main question studied is the design of efficient seed
alphabets to construct seeds with optimal sensitivity/selectivity trade-offs.
We propose several different design methods and use them to construct several
alphabets.We then perform an analysis of seeds built over those alphabet and
compare them with the standard Blastp seeding method [2,3], as well as with the
family of vector seeds proposed in [4]. While the formalism of subset seed is
less expressive (but less costly to implement) than the accumulative principle
used in Blastp and vector seeds, our seeds show a similar or even better
performance than Blastp on Bernoulli models of proteins compatible with the
common BLOSUM62 matrix
YASS: Similarity search in DNA sequences
We describe YASS -- a new tool for finding local similarities in DNA sequences. The YASS algorithm first scans the sequence(s) and creates on the fly groups of (small exact repeats obtained by hashing) according to statistically-founded criteria. Then it tries to extend those groups into similarity regions on the basis of a new extension criterion. The method can be seen as a compromise between single-seed () and multiple-seed (, ) approaches, and achieves a gain in both sensitivity and selectivity. The method is flexible and can be made more efficient by using spaced seeds, and in particular transition-constrained spaced seeds. We provide examples of applying YASS to Saccharomyces Cerevisiae and Drosophila Melanogaster chromosomes
- …