5,555 research outputs found
Whole Genome Phylogenetic Tree Reconstruction Using Colored de Bruijn Graphs
We present kleuren, a novel assembly-free method to reconstruct phylogenetic
trees using the Colored de Bruijn Graph. kleuren works by constructing the
Colored de Bruijn Graph and then traversing it, finding bubble structures in
the graph that provide phylogenetic signal. The bubbles are then aligned and
concatenated to form a supermatrix, from which a phylogenetic tree is inferred.
We introduce the algorithms that kleuren uses to accomplish this task, and show
its performance on reconstructing the phylogenetic tree of 12 Drosophila
species. kleuren reconstructed the established phylogenetic tree accurately,
and is a viable tool for phylogenetic tree reconstruction using whole genome
sequences. Software package available at: https://github.com/Colelyman/kleurenComment: 6 pages, 3 figures, accepted at BIBE 2017. Minor modifications to the
text due to reviewer feedback and fixed typo
The Collatz conjecture and De Bruijn graphs
We study variants of the well-known Collatz graph, by considering the action
of the 3n+1 function on congruence classes. For moduli equal to powers of 2,
these graphs are shown to be isomorphic to binary De Bruijn graphs. Unlike the
Collatz graph, these graphs are very structured, and have several interesting
properties. We then look at a natural generalization of these finite graphs to
the 2-adic integers, and show that the isomorphism between these infinite
graphs is exactly the conjugacy map previously studied by Bernstein and
Lagarias. Finally, we show that for generalizations of the 3n+1 function, we
get similar relations with 2-adic and p-adic De Bruijn graphs.Comment: 9 pages, 8 figure
Cerulean: A hybrid assembly using high throughput short and long reads
Genome assembly using high throughput data with short reads, arguably,
remains an unresolvable task in repetitive genomes, since when the length of a
repeat exceeds the read length, it becomes difficult to unambiguously connect
the flanking regions. The emergence of third generation sequencing (Pacific
Biosciences) with long reads enables the opportunity to resolve complicated
repeats that could not be resolved by the short read data. However, these long
reads have high error rate and it is an uphill task to assemble the genome
without using additional high quality short reads. Recently, Koren et al. 2012
proposed an approach to use high quality short reads data to correct these long
reads and, thus, make the assembly from long reads possible. However, due to
the large size of both dataset (short and long reads), error-correction of
these long reads requires excessively high computational resources, even on
small bacterial genomes. In this work, instead of error correction of long
reads, we first assemble the short reads and later map these long reads on the
assembly graph to resolve repeats.
Contribution: We present a hybrid assembly approach that is both
computationally effective and produces high quality assemblies. Our algorithm
first operates with a simplified version of the assembly graph consisting only
of long contigs and gradually improves the assembly by adding smaller contigs
in each iteration. In contrast to the state-of-the-art long reads error
correction technique, which requires high computational resources and long
running time on a supercomputer even for bacterial genome datasets, our
software can produce comparable assembly using only a standard desktop in a
short running time.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
Number of cycles in the graph of 312-avoiding permutations
The graph of overlapping permutations is defined in a way analogous to the De
Bruijn graph on strings of symbols. That is, for every permutation there is a directed edge from the
standardization of to the standardization of
. We give a formula for the number of cycles of
length in the subgraph of overlapping 312-avoiding permutations. Using this
we also give a refinement of the enumeration of 312-avoiding affine
permutations and point out some open problems on this graph, which so far has
been little studied.Comment: To appear in the Journal of Combinatorial Theory - Series
Telescoper: de novo assembly of highly repetitive regions.
MotivationWith advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging.ResultsIn this article, we tackle the problem of assembling highly repetitive regions by developing a novel algorithm that iteratively extends long paths through a series of read-overlap graphs and evaluates them based on a statistical framework. Our algorithm, Telescoper, uses short- and long-insert libraries in an integrated way throughout the assembly process. Results on real and simulated data demonstrate that our approach can effectively resolve much of the complex repeat structures found in the telomeres of yeast genomes, especially when longer long-insert libraries are used.AvailabilityTelescoper is publicly available for download at sourceforge.net/p/[email protected] informationSupplementary data are available at Bioinformatics online
Perfect Necklaces
We introduce a variant of de Bruijn words that we call perfect necklaces. Fix
a finite alphabet. Recall that a word is a finite sequence of symbols in the
alphabet and a circular word, or necklace, is the equivalence class of a word
under rotations. For positive integers k and n, we call a necklace
(k,n)-perfect if each word of length k occurs exactly n times at positions
which are different modulo n for any convention on the starting point. We call
a necklace perfect if it is (k,k)-perfect for some k. We prove that every
arithmetic sequence with difference coprime with the alphabet size induces a
perfect necklace. In particular, the concatenation of all words of the same
length in lexicographic order yields a perfect necklace. For each k and n, we
give a closed formula for the number of (k,n)-perfect necklaces. Finally, we
prove that every infinite periodic sequence whose period coincides with some
(k,n)-perfect necklace for any n, passes all statistical tests of size up to k,
but not all larger tests. This last theorem motivated this work
- …