125 research outputs found
Detecting Superbubbles in Assembly Graphs
We introduce a new concept of a subgraph class called a superbubble for
analyzing assembly graphs, and propose an efficient algorithm for detecting it.
Most assembly algorithms utilize assembly graphs like the de Bruijn graph or
the overlap graph constructed from reads. From these graphs, many assembly
algorithms first detect simple local graph structures (motifs), such as tips
and bubbles, mainly to find sequencing errors. These motifs are easy to detect,
but they are sometimes too simple to deal with more complex errors. The
superbubble is an extension of the bubble, which is also important for
analyzing assembly graphs. Though superbubbles are much more complex than
ordinary bubbles, we show that they can be efficiently enumerated. We propose
an average-case linear time algorithm (i.e., O(n+m) for a graph with n vertices
and m edges) for graphs with a reasonable model, though the worst-case time
complexity of our algorithm is quadratic (i.e., O(n(n+m))). Moreover, the
algorithm is practically very fast: Our experiments show that our algorithm
runs in reasonable time with a single CPU core even against a very large graph
of a whole human genome.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
Indexing Graph Search Trees and Applications
We consider the problem of compactly representing the Depth First Search (DFS) tree of a given undirected or directed graph having n vertices and m edges while supporting various DFS related queries efficiently in the RAM with logarithmic word size. We study this problem in two well-known models: indexing and encoding models. While most of these queries can be supported easily in constant time using O(n lg n) bits of extra space, our goal here is, more specifically, to beat this trivial O(n lg n) bit space bound, yet not compromise too much on the running time of these queries. In the indexing model, the space bound of our solution involves the quantity m, hence, we obtain different bounds for sparse and dense graphs respectively. In the encoding model, we first give a space lower bound, followed by an almost optimal data structure with extremely fast query time. Central to our algorithm is a partitioning of the DFS tree into connected subtrees, and a compact way to store these connections. Finally, we also apply these techniques to compactly index the shortest path structure, biconnectivity structures among others
Compression with the tudocomp Framework
We present a framework facilitating the implementation and comparison of text compression algorithms. We evaluate its features by a case study on two novel compression algorithms based on the Lempel-Ziv compression schemes that perform well on highly repetitive texts
Storing Set Families More Compactly with Top ZDDs
Zero-suppressed Binary Decision Diagrams (ZDDs) are data structures for representing set families in a compressed form. With ZDDs, many valuable operations on set families can be done in time polynomial in ZDD size. In some cases, however, the size of ZDDs for representing large set families becomes too huge to store them in the main memory.
This paper proposes top ZDD, a novel representation of ZDDs which uses less space than existing ones. The top ZDD is an extension of top tree, which compresses trees, to compress directed acyclic graphs by sharing identical subgraphs. We prove that navigational operations on ZDDs can be done in time poly-logarithmic in ZDD size, and show that there exist set families for which the size of the top ZDD is exponentially smaller than that of the ZDD. We also show experimentally that our top ZDDs have smaller size than ZDDs for real data
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
MEGAHIT is a NGS de novo assembler for assembling large and complex
metagenomics data in a time- and cost-efficient manner. It finished assembling
a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a
single computing node with and without a GPU, respectively. MEGAHIT assembles
the data as a whole, i.e., it avoids pre-processing like partitioning and
normalization, which might compromise on result integrity. MEGAHIT generates 3
times larger assembly, with longer contig N50 and average contig length than
the previous assembly. 55.8% of the reads were aligned to the assembly, which
is 4 times higher than the previous. The source code of MEGAHIT is freely
available at https://github.com/voutcn/megahit under GPLv3 license.Comment: 2 pages, 2 tables, 1 figure, submitted to Oxford Bioinformatics as an
Application Not
Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting
Two new algorithms for improving the speed of the LZ77 compression are proposed. One is based on a new hashing algorithm named two-level hashing that enables fast longest match searching from a sliding dictionary, and the other uses suffix sorting. The former is suitable for small dictionaries and it significantly improves the speed of gzip, which uses a naive hashing algorithm. The latter is suitable for large dictionaries which improve compression ratio for large files. We also experiment on the compression ratio and the speed of block sorting compression, which uses suffix sorting in its compression algorithm. The results show that the LZ77 using the two-level hash is suitable for small dictionaries, the LZ77 using suffix sorting is good for large dictionaries when fast decompression speed and efficient use of memory are necessary, and block sorting is good for large dictionaries.PAPE
- …