18 research outputs found
Sketching for Large-Scale Learning of Mixture Models
Learning parameters from voluminous data can be prohibitive in terms of
memory and computational requirements. We propose a "compressive learning"
framework where we estimate model parameters from a sketch of the training
data. This sketch is a collection of generalized moments of the underlying
probability distribution of the data. It can be computed in a single pass on
the training set, and is easily computable on streams or distributed datasets.
The proposed framework shares similarities with compressive sensing, which aims
at drastically reducing the dimension of high-dimensional signals while
preserving the ability to reconstruct them. To perform the estimation task, we
derive an iterative algorithm analogous to sparse reconstruction algorithms in
the context of linear inverse problems. We exemplify our framework with the
compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics
on the choice of the sketching procedure and theoretical guarantees of
reconstruction. We experimentally show on synthetic data that the proposed
algorithm yields results comparable to the classical Expectation-Maximization
(EM) technique while requiring significantly less memory and fewer computations
when the number of database elements is large. We further demonstrate the
potential of the approach on real large-scale data (over 10 8 training samples)
for the task of model-based speaker verification. Finally, we draw some
connections between the proposed framework and approximate Hilbert space
embedding of probability distributions using random features. We show that the
proposed sketching operator can be seen as an innovative method to design
translation-invariant kernels adapted to the analysis of GMMs. We also use this
theoretical framework to derive information preservation guarantees, in the
spirit of infinite-dimensional compressive sensing
Projected gradient descent for non-convex sparse spike estimation
We propose a new algorithm for sparse spike estimation from Fourier
measurements. Based on theoretical results on non-convex optimization
techniques for off-the-grid sparse spike estimation, we present a projected
gradient descent algorithm coupled with a spectral initialization procedure.
Our algorithm permits to estimate the positions of large numbers of Diracs in
2d from random Fourier measurements. We present, along with the algorithm,
theoretical qualitative insights explaining the success of our algorithm. This
opens a new direction for practical off-the-grid spike estimation with
theoretical guarantees in imaging applications
Quantized Compressive K-Means
The recent framework of compressive statistical learning aims at designing
tractable learning algorithms that use only a heavily compressed
representation-or sketch-of massive datasets. Compressive K-Means (CKM) is such
a method: it estimates the centroids of data clusters from pooled, non-linear,
random signatures of the learning examples. While this approach significantly
reduces computational time on very large datasets, its digital implementation
wastes acquisition resources because the learning examples are compressed only
after the sensing stage. The present work generalizes the sketching procedure
initially defined in Compressive K-Means to a large class of periodic
nonlinearities including hardware-friendly implementations that compressively
acquire entire datasets. This idea is exemplified in a Quantized Compressive
K-Means procedure, a variant of CKM that leverages 1-bit universal quantization
(i.e. retaining the least significant bit of a standard uniform quantizer) as
the periodic sketch nonlinearity. Trading for this resource-efficient signature
(standard in most acquisition schemes) has almost no impact on the clustering
performances, as illustrated by numerical experiments
A Sketching Framework for Reduced Data Transfer in Photon Counting Lidar
Single-photon lidar has become a prominent tool for depth imaging in recent
years. At the core of the technique, the depth of a target is measured by
constructing a histogram of time delays between emitted light pulses and
detected photon arrivals. A major data processing bottleneck arises on the
device when either the number of photons per pixel is large or the resolution
of the time stamp is fine, as both the space requirement and the complexity of
the image reconstruction algorithms scale with these parameters. We solve this
limiting bottleneck of existing lidar techniques by sampling the characteristic
function of the time of flight (ToF) model to build a compressive statistic, a
so-called sketch of the time delay distribution, which is sufficient to infer
the spatial distance and intensity of the object. The size of the sketch scales
with the degrees of freedom of the ToF model (number of objects) and not,
fundamentally, with the number of photons or the time stamp resolution.
Moreover, the sketch is highly amenable for on-chip online processing. We show
theoretically that the loss of information for compression is controlled and
the mean squared error of the inference quickly converges towards the optimal
Cram\'er-Rao bound (i.e. no loss of information) for modest sketch sizes. The
proposed compressed single-photon lidar framework is tested and evaluated on
real life datasets of complex scenes where it is shown that a compression rate
of up-to 150 is achievable in practice without sacrificing the overall
resolution of the reconstructed image.Comment: 16 pages, 20 figure
Sketching for nearfield acoustic imaging of heavy-tailed sources
International audienceWe propose a probabilistic model for acoustic source localization with known but arbitrary geometry of the microphone array. The approach has several features. First, it relies on a simple nearfield acoustic model for wave propagation. Second, it does not require the number of active sources. On the contrary, it produces a heat map representing the energy of a large set of candidate locations, thus imaging the acoustic field. Second, it relies on a heavy-tail alpha-stable probabilistic model, whose most important feature is to yield an estimation strategy where the multichannel signals need to be processed only once in a simple on- line procedure, called sketching. This sketching produces a fixed-sized representation of the data that is then analyzed for localization. The resulting algorithm has a small computational complexity and in this paper, we demonstrate that it compares favorably with state of the art for localization in realistic simulations of reverberant environments
Sketched Clustering via Hybrid Approximate Message Passing
International audienceIn sketched clustering, the dataset is first sketched down to a vector of modest size, from which the cluster centers are subsequently extracted. The goal is to perform clustering more efficiently than with methods that operate on the full training data, such as k-means++. For the sketching methodology recently proposed by Keriven, Gribonval, et al., which can be interpreted as a random sampling of the empirical characteristic function, we propose a cluster recovery algorithm based on simplified hybrid generalized approximate message passing (SHyGAMP). Numerical experiments suggest that our approach is more efficient than the state-of-the-art sketched clustering algorithms (in both computational and sample complexity) and more efficient than k-means++ in certain regimes