22,982 research outputs found

    On-Disk Data Processing: Issues and Future Directions

    Get PDF
    In this paper, we present a survey of "on-disk" data processing (ODDP). ODDP, which is a form of near-data processing, refers to the computing arrangement where the secondary storage drives have the data processing capability. Proposed ODDP schemes vary widely in terms of the data processing capability, target applications, architecture and the kind of storage drive employed. Some ODDP schemes provide only a specific but heavily used operation like sort whereas some provide a full range of operations. Recently, with the advent of Solid State Drives, powerful and extensive ODDP solutions have been proposed. In this paper, we present a thorough review of architectures developed for different on-disk processing approaches along with current and future challenges and also identify the future directions which ODDP can take.Comment: 24 pages, 17 Figures, 3 Table

    Simple I/O-efficient flow accumulation on grid terrains

    Full text link
    The flow accumulation problem for grid terrains takes as input a matrix of flow directions, that specifies for each cell of the grid to which of its eight neighbours any incoming water would flow. The problem is to compute, for each cell c, from how many cells of the terrain water would reach c. We show that this problem can be solved in O(scan(N)) I/Os for a terrain of N cells. Taking constant factors in the I/O-efficiency into account, our algorithm may be an order of magnitude faster than the previously known algorithm that is based on time-forward processing and needs O(sort(N)) I/Os.Comment: This paper is an exact copy of the paper that appeared in the abstract collection of the Workshop on Massive Data Algorithms, Aarhus, 200

    Gerbil: A Fast and Memory-Efficient kk-mer Counter with GPU-Support

    Get PDF
    A basic task in bioinformatics is the counting of kk-mers in genome strings. The kk-mer counting problem is to build a histogram of all substrings of length kk in a given genome sequence. We present the open source kk-mer counting software Gerbil that has been designed for the efficient counting of kk-mers for k≥32k\geq32. Given the technology trend towards long reads of next-generation sequencers, support for large kk becomes increasingly important. While existing kk-mer counting tools suffer from excessive memory resource consumption or degrading performance for large kk, Gerbil is able to efficiently support large kk without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into kk-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large kk, we outperform state-of-the-art open source kk-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI 201

    FORM version 4.0

    Full text link
    We present version 4.0 of the symbolic manipulation system FORM. The most important new features are manipulation of rational polynomials and the factorization of expressions. Many other new functions and commands are also added; some of them are very general, while others are designed for building specific high level packages, such as one for Groebner bases. New is also the checkpoint facility, that allows for periodic backups during long calculations. Lastly, FORM 4.0 has become available as open source under the GNU General Public License version 3.Comment: 26 pages. Uses axodra

    I/O-optimal algorithms on grid graphs

    Full text link
    Given a graph of which the n vertices form a regular two-dimensional grid, and in which each (possibly weighted and/or directed) edge connects a vertex to one of its eight neighbours, the following can be done in O(scan(n)) I/Os, provided M = Omega(B^2): computation of shortest paths with non-negative edge weights from a single source, breadth-first traversal, computation of a minimum spanning tree, topological sorting, time-forward processing (if the input is a plane graph), and an Euler tour (if the input graph is a tree). The minimum-spanning tree algorithm is cache-oblivious. The best previously published algorithms for these problems need Theta(sort(n)) I/Os. Estimates of the actual I/O volume show that the new algorithms may often be very efficient in practice.Comment: 12 pages' extended abstract plus 12 pages' appendix with details, proofs and calculations. Has not been published in and is currently not under review of any conference or journa
    • …
    corecore