8,000 research outputs found
bdbms -- A Database Management System for Biological Data
Biologists are increasingly using databases for storing and managing their
data. Biological databases typically consist of a mixture of raw data,
metadata, sequences, annotations, and related data obtained from various
sources. Current database technology lacks several functionalities that are
needed by biological databases. In this paper, we introduce bdbms, an
extensible prototype database management system for supporting biological data.
bdbms extends the functionalities of current DBMSs to include: (1) Annotation
and provenance management including storage, indexing, manipulation, and
querying of annotation and provenance as first class objects in bdbms, (2)
Local dependency tracking to track the dependencies and derivations among data
items, (3) Update authorization to support data curation via content-based
authorization, in contrast to identity-based authorization, and (4) New access
methods and their supporting operators that support pattern matching on various
types of compressed biological data types. This paper presents the design of
bdbms along with the techniques proposed to support these functionalities
including an extension to SQL. We also outline some open issues in building
bdbms.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
Geometry-Oblivious FMM for Compressing Dense SPD Matrices
We present GOFMM (geometry-oblivious FMM), a novel method that creates a
hierarchical low-rank approximation, "compression," of an arbitrary dense
symmetric positive definite (SPD) matrix. For many applications, GOFMM enables
an approximate matrix-vector multiplication in or even time,
where is the matrix size. Compression requires storage and work.
In general, our scheme belongs to the family of hierarchical matrix
approximation methods. In particular, it generalizes the fast multipole method
(FMM) to a purely algebraic setting by only requiring the ability to sample
matrix entries. Neither geometric information (i.e., point coordinates) nor
knowledge of how the matrix entries have been generated is required, thus the
term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme
for hierarchical matrix computations that reduces synchronization barriers. We
present results on the Intel Knights Landing and Haswell architectures, and on
the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1
Data Streams from the Low Frequency Instrument On-Board the Planck Satellite: Statistical Analysis and Compression Efficiency
The expected data rate produced by the Low Frequency Instrument (LFI) planned
to fly on the ESA Planck mission in 2007, is over a factor 8 larger than the
bandwidth allowed by the spacecraft transmission system to download the LFI
data. We discuss the application of lossless compression to Planck/LFI data
streams in order to reduce the overall data flow. We perform both theoretical
analysis and experimental tests using realistically simulated data streams in
order to fix the statistical properties of the signal and the maximal
compression rate allowed by several lossless compression algorithms. We studied
the influence of signal composition and of acquisition parameters on the
compression rate Cr and develop a semiempirical formalism to account for it.
The best performing compressor tested up to now is the arithmetic compression
of order 1, designed for optimizing the compression of white noise like
signals, which allows an overall compression rate = 2.65 +/- 0.02. We find
that such result is not improved by other lossless compressors, being the
signal almost white noise dominated. Lossless compression algorithms alone will
not solve the bandwidth problem but needs to be combined with other techniques.Comment: May 3, 2000 release, 61 pages, 6 figures coded as eps, 9 tables (4
included as eps), LaTeX 2.09 + assms4.sty, style file included, submitted for
the pubblication on PASP May 3, 200
Shift Aggregate Extract Networks
We introduce an architecture based on deep hierarchical decompositions to
learn effective representations of large graphs. Our framework extends classic
R-decompositions used in kernel methods, enabling nested "part-of-part"
relations. Unlike recursive neural networks, which unroll a template on input
graphs directly, we unroll a neural network template over the decomposition
hierarchy, allowing us to deal with the high degree variability that typically
characterize social network graphs. Deep hierarchical decompositions are also
amenable to domain compression, a technique that reduces both space and time
complexity by exploiting symmetries. We show empirically that our approach is
competitive with current state-of-the-art graph classification methods,
particularly when dealing with social network datasets
On optimally partitioning a text to improve its compression
In this paper we investigate the problem of partitioning an input string T in
such a way that compressing individually its parts via a base-compressor C gets
a compressed output that is shorter than applying C over the entire T at once.
This problem was introduced in the context of table compression, and then
further elaborated and extended to strings and trees. Unfortunately, the
literature offers poor solutions: namely, we know either a cubic-time algorithm
for computing the optimal partition based on dynamic programming, or few
heuristics that do not guarantee any bounds on the efficacy of their computed
partition, or algorithms that are efficient but work in some specific scenarios
(such as the Burrows-Wheeler Transform) and achieve compression performance
that might be worse than the optimal-partitioning by a
factor. Therefore, computing efficiently the optimal solution is still open. In
this paper we provide the first algorithm which is guaranteed to compute in
O(n \log_{1+\eps}n) time a partition of T whose compressed output is
guaranteed to be no more than -worse the optimal one, where
may be any positive constant
- âŠ