Search CORE

4,061 research outputs found

Pairwise and incremental multi-stage alignment of metagenomes: A new proposal

Author: Pérez-Wohlfeil Esteban
Torreno Oscar
Trelles-Salazar Oswaldo Rogelio
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2017
Field of study

Traditional comparisons between metagenomes are often performed using reference databases as intermediary templates from which to obtain distance metrics. However, in order to fully exploit the potential of the information contained within metagenomes, it becomes of interest to remove any intermediate agent that is prone to introduce errors or biased results. In this work, we perform an analysis over the state of the art methods and deduce that it is necessary to employ fine-grained methods in order to assess similarity between metagenomes. In addition, we propose our developed method for accurate and fast matching of reads.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Repositorio Institucional Universidad de Málaga

Scaling Laws and Similarity Detection in Sequence Alignment with Gaps

Author: Drasdo Dirk
Hwa Terence
Lassig Michael
Publication venue
Publication date: 01/01/1998
Field of study

We study the problem of similarity detection by sequence alignment with gaps, using a recently established theoretical framework based on the morphology of alignment paths. Alignments of sequences without mutual correlations are found to have scale-invariant statistics. This is the basis for a scaling theory of alignments of correlated sequences. Using a simple Markov model of evolution, we generate sequences with well-defined mutual correlations and quantify the fidelity of an alignment in an unambiguous way. The scaling theory predicts the dependence of the fidelity on the alignment parameters and on the statistical evolution parameters characterizing the sequence correlations. Specific criteria for the optimal choice of alignment parameters emerge from this theory. The results are verified by extensive numerical simulations.Comment: 25 pages, 11 figure

arXiv.org e-Print Archive

CiteSeerX

CERN Document Server

A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

Author: Laurent Noé
Donald E.K. Martin
Apostolico A.
Bassino F.
Boden M.
Břinda K.
Burkhardt S.
Egidi L.
Gambin A.
Leslie C.S.
Martin D.E.K.
Martin D.E.K.
Régnier M.
Simon I.
Zhou L.
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2010
Field of study

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Copenhagen University Research Information System

A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

Author: Martin Donald E. K.
Noé Laurent
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2014
Field of study

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server

PubMed Central

The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment

Author: Altschul
Altschul
Altschul
Bundschuh
Collins
Gotoh
Henikoff
J. L. Spouge
Karlin
Mott
Mott
Mott
Needleman
Robinson
S. Sheetlin
Smith
Smith
Storey
Waterman
Y. Park
Yu
Publication venue: Oxford University Press
Publication date: 06/09/2005
Field of study

The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor

Crossref

PubMed Central

Back-translation for discovering distant protein homologies

Author: A. Pedersen
B. Oostra
C. Kosiol
J. Leluk
J. Leluk
J. Raes
K. Okamura
L. Arvestad
L. Delaye
M. Clamp
M. Pellegrini
P. Harrison
P. Lio
R. Blake
S. Altschul
S. Altschul
S. Altschul
Y. Hahn
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. To cope with this situation, we propose a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.Comment: The 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia : \'Etats-Unis d'Am\'erique (2009

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Optimization of miRNA-seq data preprocessing.

Author: McPherson John D
Tam Shirley
Tsao Ming-Sound
Publication venue: eScholarship, University of California
Publication date: 17/04/2015
Field of study

The past two decades of microRNA (miRNA) research has solidified the role of these small non-coding RNAs as key regulators of many biological processes and promising biomarkers for disease. The concurrent development in high-throughput profiling technology has further advanced our understanding of the impact of their dysregulation on a global scale. Currently, next-generation sequencing is the platform of choice for the discovery and quantification of miRNAs. Despite this, there is no clear consensus on how the data should be preprocessed before conducting downstream analyses. Often overlooked, data preprocessing is an essential step in data analysis: the presence of unreliable features and noise can affect the conclusions drawn from downstream analyses. Using a spike-in dilution study, we evaluated the effects of several general-purpose aligners (BWA, Bowtie, Bowtie 2 and Novoalign), and normalization methods (counts-per-million, total count scaling, upper quartile scaling, Trimmed Mean of M, DESeq, linear regression, cyclic loess and quantile) with respect to the final miRNA count data distribution, variance, bias and accuracy of differential expression analysis. We make practical recommendations on the optimal preprocessing methods for the extraction and interpretation of miRNA count data from small RNA-sequencing experiments

CiteSeerX

PubMed Central

eScholarship - University of California