22,982 research outputs found
On-Disk Data Processing: Issues and Future Directions
In this paper, we present a survey of "on-disk" data processing (ODDP). ODDP,
which is a form of near-data processing, refers to the computing arrangement
where the secondary storage drives have the data processing capability.
Proposed ODDP schemes vary widely in terms of the data processing capability,
target applications, architecture and the kind of storage drive employed. Some
ODDP schemes provide only a specific but heavily used operation like sort
whereas some provide a full range of operations. Recently, with the advent of
Solid State Drives, powerful and extensive ODDP solutions have been proposed.
In this paper, we present a thorough review of architectures developed for
different on-disk processing approaches along with current and future
challenges and also identify the future directions which ODDP can take.Comment: 24 pages, 17 Figures, 3 Table
Simple I/O-efficient flow accumulation on grid terrains
The flow accumulation problem for grid terrains takes as input a matrix of
flow directions, that specifies for each cell of the grid to which of its eight
neighbours any incoming water would flow. The problem is to compute, for each
cell c, from how many cells of the terrain water would reach c. We show that
this problem can be solved in O(scan(N)) I/Os for a terrain of N cells. Taking
constant factors in the I/O-efficiency into account, our algorithm may be an
order of magnitude faster than the previously known algorithm that is based on
time-forward processing and needs O(sort(N)) I/Os.Comment: This paper is an exact copy of the paper that appeared in the
abstract collection of the Workshop on Massive Data Algorithms, Aarhus, 200
Gerbil: A Fast and Memory-Efficient -mer Counter with GPU-Support
A basic task in bioinformatics is the counting of -mers in genome strings.
The -mer counting problem is to build a histogram of all substrings of
length in a given genome sequence. We present the open source -mer
counting software Gerbil that has been designed for the efficient counting of
-mers for . Given the technology trend towards long reads of
next-generation sequencers, support for large becomes increasingly
important. While existing -mer counting tools suffer from excessive memory
resource consumption or degrading performance for large , Gerbil is able to
efficiently support large without much loss of performance. Our software
implements a two-disk approach. In the first step, DNA reads are loaded from
disk and distributed to temporary files that are stored at a working disk. In a
second step, the temporary files are read again, split into -mers and
counted via a hash table approach. In addition, Gerbil can optionally use GPUs
to accelerate the counting step. For large , we outperform state-of-the-art
open source -mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI
201
FORM version 4.0
We present version 4.0 of the symbolic manipulation system FORM. The most
important new features are manipulation of rational polynomials and the
factorization of expressions. Many other new functions and commands are also
added; some of them are very general, while others are designed for building
specific high level packages, such as one for Groebner bases. New is also the
checkpoint facility, that allows for periodic backups during long calculations.
Lastly, FORM 4.0 has become available as open source under the GNU General
Public License version 3.Comment: 26 pages. Uses axodra
I/O-optimal algorithms on grid graphs
Given a graph of which the n vertices form a regular two-dimensional grid,
and in which each (possibly weighted and/or directed) edge connects a vertex to
one of its eight neighbours, the following can be done in O(scan(n)) I/Os,
provided M = Omega(B^2): computation of shortest paths with non-negative edge
weights from a single source, breadth-first traversal, computation of a minimum
spanning tree, topological sorting, time-forward processing (if the input is a
plane graph), and an Euler tour (if the input graph is a tree). The
minimum-spanning tree algorithm is cache-oblivious. The best previously
published algorithms for these problems need Theta(sort(n)) I/Os. Estimates of
the actual I/O volume show that the new algorithms may often be very efficient
in practice.Comment: 12 pages' extended abstract plus 12 pages' appendix with details,
proofs and calculations. Has not been published in and is currently not under
review of any conference or journa
- …