64 research outputs found
Pattern-based phylogenetic distance estimation and tree reconstruction
We have developed an alignment-free method that calculates phylogenetic
distances using a maximum likelihood approach for a model of sequence change on
patterns that are discovered in unaligned sequences. To evaluate the
phylogenetic accuracy of our method, and to conduct a comprehensive comparison
of existing alignment-free methods (freely available as Python package decaf+py
at http://www.bioinformatics.org.au), we have created a dataset of reference
trees covering a wide range of phylogenetic distances. Amino acid sequences
were evolved along the trees and input to the tested methods; from their
calculated distances we infered trees whose topologies we compared to the
reference trees.
We find our pattern-based method statistically superior to all other tested
alignment-free methods on this dataset. We also demonstrate the general
advantage of alignment-free methods over an approach based on automated
alignments when sequences violate the assumption of collinearity. Similarly, we
compare methods on empirical data from an existing alignment benchmark set that
we used to derive reference distances and trees. Our pattern-based approach
yields distances that show a linear relationship to reference distances over a
substantially longer range than other alignment-free methods. The pattern-based
approach outperforms alignment-free methods and its phylogenetic accuracy is
statistically indistinguishable from alignment-based distances.Comment: 21 pages, 3 figures, 2 table
Sympathetic cooling in a mixture of diamagnetic and paramagnetic atoms
We have experimentally realized a hybrid trap for ultracold paramagnetic
rubidium and diamagnetic ytterbium atoms by combining a bichromatic optical
dipole trap for ytterbium with a Ioffe-Pritchard-type magnetic trap for
rubidium. In this hybrid trap, sympathetic cooling of five different ytterbium
isotopes through elastic collisions with rubidium was achieved. A strong
dependence of the interspecies collisional cross section on the mass of the
ytterbium isotope was observed.Comment: 4 pages, 4 figure
Predicting the influence of a p2-symmetric substrate on molecular self-organization with an interaction-site model
An interaction-site model can a priori predict molecular selforganisation on a new substrate in Monte Carlo simulations. This is experimentally confirmed with scanning tunnelling microscopy on Fre´chet dendrons of a pentacontane template. Local and global ordering motifs, inclusion molecules and a rotated unit cell are correctly predicted
Evolutionary distances in the twilight zone -- a rational kernel approach
Phylogenetic tree reconstruction is traditionally based on multiple sequence
alignments (MSAs) and heavily depends on the validity of this information
bottleneck. With increasing sequence divergence, the quality of MSAs decays
quickly. Alignment-free methods, on the other hand, are based on abstract
string comparisons and avoid potential alignment problems. However, in general
they are not biologically motivated and ignore our knowledge about the
evolution of sequences. Thus, it is still a major open question how to define
an evolutionary distance metric between divergent sequences that makes use of
indel information and known substitution models without the need for a multiple
alignment. Here we propose a new evolutionary distance metric to close this
gap. It uses finite-state transducers to create a biologically motivated
similarity score which models substitutions and indels, and does not depend on
a multiple sequence alignment. The sequence similarity score is defined in
analogy to pairwise alignments and additionally has the positive semi-definite
property. We describe its derivation and show in simulation studies and
real-world examples that it is more accurate in reconstructing phylogenies than
competing methods. The result is a new and accurate way of determining
evolutionary distances in and beyond the twilight zone of sequence alignments
that is suitable for large datasets.Comment: to appear in PLoS ON
Simultaneous identification of long similar substrings in large sets of sequences
<p>Abstract</p> <p>Background</p> <p>Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered.</p> <p>Results</p> <p>We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 <it>Medicago truncatula </it>BAC-size sequences published at <url>http://www.medicago.org/genome/assembly_table.php?chr=1</url>.</p> <p>Conclusion</p> <p>The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps.</p> <p>ClustDB is freely available for academic use.</p
Fast algorithms for computing sequence distances by exhaustive substring composition
The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background
SeqAn An efficient, generic C++ library for sequence analysis
<p>Abstract</p> <p>Background</p> <p>The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use.</p> <p>Results</p> <p>To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use.</p> <p>Conclusion</p> <p>We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.</p
Complete Genome Viral Phylogenies Suggests the Concerted Evolution of Regulatory Cores and Accessory Satellites
We consider the concerted evolution of viral genomes in four families of DNA viruses. Given the high rate of horizontal gene transfer among viruses and their hosts, it is an open question as to how representative particular genes are of the evolutionary history of the complete genome. To address the concerted evolution of viral genes, we compared genomic evolution across four distinct, extant viral families. For all four viral families we constructed DNA-dependent DNA polymerase-based (DdDp) phylogenies and in addition, whole genome sequence, as quantitative descriptions of inter-genome relationships. We found that the history of the polymerase gene was highly predictive of the history of the genome as a whole, which we explain in terms of repeated, co-divergence events of the core DdDp gene accompanied by a number of satellite, accessory genetic loci. We also found that the rate of gene gain in baculovirus and poxviruses proceeds significantly more quickly than the rate of gene loss and that there is convergent acquisition of satellite functions promoting contextual adaptation when distinct viral families infect related hosts. The congruence of the genome and polymerase trees suggests that a large set of viral genes, including polymerase, derive from a phylogenetically conserved core of genes of host origin, secondarily reinforced by gene acquisition from common hosts or co-infecting viruses within the host. A single viral genome can be thought of as a mutualistic network, with the core genes acting as an effective host and the satellite genes as effective symbionts. Larger virus genomes show a greater departure from linkage equilibrium between core and satellites functions
- …