44,721 research outputs found
The SBC-Tree: An Index for Run-Length Compressed Sequences
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompressing it. In t.his paper, we present the String &tree for _Compressed sequences; termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-knoxvn String B-tree and a 3-sided range query structure. The SBC-tree supports substring as \\re11 as prefix m,atching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m taltes O(m logB(N + m)) I/O operations. Substring match,ing, pre,fix matching, and range search execute in an optimal O(log, N + F) I/O operations, where Ip is the length of the compressed query pattern and T is the query output size. Re present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure: and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worstcase theoret.ica1 bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in t,he context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30 % reduction in 110s for the insertion operations, and retains the optimal search performance achieved by the St,ring B-tree over the uncompressed sequences.!I c 0,
Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain
Real-world data typically contain repeated and periodic patterns. This
suggests that they can be effectively represented and compressed using only a
few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.).
However, distance estimation when the data are represented using different sets
of coefficients is still a largely unexplored area. This work studies the
optimization problems related to obtaining the \emph{tightest} lower/upper
bound on Euclidean distances when each data object is potentially compressed
using a different set of orthonormal coefficients. Our technique leads to
tighter distance estimates, which translates into more accurate search,
learning and mining operations \textit{directly} in the compressed domain.
We formulate the problem of estimating lower/upper distance bounds as an
optimization problem. We establish the properties of optimal solutions, and
leverage the theoretical analysis to develop a fast algorithm to obtain an
\emph{exact} solution to the problem. The suggested solution provides the
tightest estimation of the -norm or the correlation. We show that typical
data-analysis operations, such as k-NN search or k-Means clustering, can
operate more accurately using the proposed compression and distance
reconstruction technique. We compare it with many other prevalent compression
and reconstruction techniques, including random projections and PCA-based
techniques. We highlight a surprising result, namely that when the data are
highly sparse in some basis, our technique may even outperform PCA-based
compression.
The contributions of this work are generic as our methodology is applicable
to any sequential or high-dimensional data as well as to any orthogonal data
transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD
Sequential Compressed Sensing
Compressed sensing allows perfect recovery of sparse signals (or signals
sparse in some basis) using only a small number of random measurements.
Existing results in compressed sensing literature have focused on
characterizing the achievable performance by bounding the number of samples
required for a given level of signal sparsity. However, using these bounds to
minimize the number of samples requires a-priori knowledge of the sparsity of
the unknown signal, or the decay structure for near-sparse signals.
Furthermore, there are some popular recovery methods for which no such bounds
are known.
In this paper, we investigate an alternative scenario where observations are
available in sequence. For any recovery method, this means that there is now a
sequence of candidate reconstructions. We propose a method to estimate the
reconstruction error directly from the samples themselves, for every candidate
in this sequence. This estimate is universal in the sense that it is based only
on the measurement ensemble, and not on the recovery method or any assumed
level of sparsity of the unknown signal. With these estimates, one can now stop
observations as soon as there is reasonable certainty of either exact or
sufficiently accurate reconstruction. They also provide a way to obtain
"run-time" guarantees for recovery methods that otherwise lack a-priori
performance bounds.
We investigate both continuous (e.g. Gaussian) and discrete (e.g. Bernoulli)
random measurement ensembles, both for exactly sparse and general near-sparse
signals, and with both noisy and noiseless measurements.Comment: to appear in IEEE transactions on Special Topics in Signal Processin
Compressed sensing reconstruction using Expectation Propagation
Many interesting problems in fields ranging from telecommunications to
computational biology can be formalized in terms of large underdetermined
systems of linear equations with additional constraints or regularizers. One of
the most studied ones, the Compressed Sensing problem (CS), consists in finding
the solution with the smallest number of non-zero components of a given system
of linear equations for known
measurement vector and sensing matrix . Here, we
will address the compressed sensing problem within a Bayesian inference
framework where the sparsity constraint is remapped into a singular prior
distribution (called Spike-and-Slab or Bernoulli-Gauss). Solution to the
problem is attempted through the computation of marginal distributions via
Expectation Propagation (EP), an iterative computational scheme originally
developed in Statistical Physics. We will show that this strategy is
comparatively more accurate than the alternatives in solving instances of CS
generated from statistically correlated measurement matrices. For computational
strategies based on the Bayesian framework such as variants of Belief
Propagation, this is to be expected, as they implicitly rely on the hypothesis
of statistical independence among the entries of the sensing matrix. Perhaps
surprisingly, the method outperforms uniformly also all the other
state-of-the-art methods in our tests.Comment: 20 pages, 6 figure
- …