18 research outputs found
Towards optimal symbolization for time series comparisons
The abundance and value of mining large time series data sets has long been acknowledged. Ubiquitous in fields ranging from astronomy, biology and web science the size and number of these datasets continues to increase, a situation exacerbated by the exponential growth of our digital footprints. The prevalence and potential utility of this data has led to a vast number of time-series data mining techniques, many of which require symbolization of the raw time series as a pre-processing step for which a number of well used, pre-existing approaches from the literature are typically employed. In this work we note that these standard approaches are sub-optimal in (at least) the broad application area of time series comparison leading to unnecessary data corruption and potential performance loss before any real data mining takes place. Addressing this we present a novel quantizer based upon optimization of comparison fidelity and a computationally tractable algorithm for its implementation on big datasets. We demonstrate empirically that our new approach provides a statistically significant reduction in the amount of error introduced by the symbolization process compared to current state-of-the-art. The approach therefore provides a more accurate input for the vast number of data mining techniques in the literature, providing the potential of increased real world performance across a wide range of existing data mining algorithms and applications
Quantization using permutation codes with a uniform source
Permutation coding is a block coding/quantization scheme where the codebook is comprised
entirely of permutations of a single starting vector. Permutation codes for the uniform
source are developed using a simple algorithm. The performance of these codes is com-
pared against scalar codes and permutation codes developed by dierent methodologies. It
is shown that the algorithm produces codes as good as other more complex methods. Theo-
retical predictions of code design parameters and code performance is veried by numerical
simulations
Concentric Permutation Source Codes
Permutation codes are a class of structured vector quantizers with a
computationally-simple encoding procedure based on sorting the scalar
components. Using a codebook comprising several permutation codes as subcodes
preserves the simplicity of encoding while increasing the number of
rate-distortion operating points, improving the convex hull of operating
points, and increasing design complexity. We show that when the subcodes are
designed with the same composition, optimization of the codebook reduces to a
lower-dimensional vector quantizer design within a single cone. Heuristics for
reducing design complexity are presented, including an optimization of the rate
allocation in a shape-gain vector quantizer with gain-dependent wrapped
spherical shape codebook
Frame Permutation Quantization
Frame permutation quantization (FPQ) is a new vector quantization technique
using finite frames. In FPQ, a vector is encoded using a permutation source
code to quantize its frame expansion. This means that the encoding is a partial
ordering of the frame expansion coefficients. Compared to ordinary permutation
source coding, FPQ produces a greater number of possible quantization rates and
a higher maximum rate. Various representations for the partitions induced by
FPQ are presented, and reconstruction algorithms based on linear programming,
quadratic programming, and recursive orthogonal projection are derived.
Implementations of the linear and quadratic programming algorithms for uniform
and Gaussian sources show performance improvements over entropy-constrained
scalar quantization for certain combinations of vector dimension and coding
rate. Monte Carlo evaluation of the recursive algorithm shows that mean-squared
error (MSE) decays as 1/M^4 for an M-element frame, which is consistent with
previous results on optimal decay of MSE. Reconstruction using the canonical
dual frame is also studied, and several results relate properties of the
analysis frame to whether linear reconstruction techniques provide consistent
reconstructions.Comment: 29 pages, 5 figures; detailed added to proof of Theorem 4.3 and a few
minor correction
Neural Distributed Compressor Discovers Binning
We consider lossy compression of an information source when the decoder has
lossless access to a correlated one. This setup, also known as the Wyner-Ziv
problem, is a special case of distributed source coding. To this day, practical
approaches for the Wyner-Ziv problem have neither been fully developed nor
heavily investigated. We propose a data-driven method based on machine learning
that leverages the universal function approximation capability of artificial
neural networks. We find that our neural network-based compression scheme,
based on variational vector quantization, recovers some principles of the
optimum theoretical solution of the Wyner-Ziv setup, such as binning in the
source space as well as optimal combination of the quantization index and side
information, for exemplary sources. These behaviors emerge although no
structure exploiting knowledge of the source distributions was imposed. Binning
is a widely used tool in information theoretic proofs and methods, and to our
knowledge, this is the first time it has been explicitly observed to emerge
from data-driven learning.Comment: draft of a journal version of our previous ISIT 2023 paper (available
at: arXiv:2305.04380). arXiv admin note: substantial text overlap with
arXiv:2305.0438