13 research outputs found
Training Gaussian Mixture Models at Scale via Coresets
How can we train a statistical mixture model on a massive data set? In this
work we show how to construct coresets for mixtures of Gaussians. A coreset is
a weighted subset of the data, which guarantees that models fitting the coreset
also provide a good fit for the original data set. We show that, perhaps
surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension
and the number of mixture components, while being independent of the data set
size. Hence, one can harness computationally intensive algorithms to compute a
good approximation on a significantly smaller data set. More importantly, such
coresets can be efficiently constructed both in distributed and streaming
settings and do not impose restrictions on the data generating process. Our
results rely on a novel reduction of statistical estimation to problems in
computational geometry and new combinatorial complexity results for mixtures of
Gaussians. Empirical evaluation on several real-world datasets suggests that
our coreset-based approach enables significant reduction in training-time with
negligible approximation error
Real-time data exploitation supported by model- and event-driven architecture to enhance situation awareness, application to crisis management
An effective crisis response requires up-to-date information. The crisis cell must reach for new, external, data sources. However, new data lead to new issues: their volume, veracity, variety or velocity cannot be managed by humans only, especially under high stress and time pressure. This paper proposes (i) a framework to enhance situation awareness while managing the 5Vs of Big Data, (ii) general principles to be followed and (iii) a new architecture implementing the proposed framework. The latter merges event-driven and model-driven architectures. It has been tested on a realistic flood scenario set up by official French services
Intelligent Reference Curation for Visual Place Recognition via Bayesian Selective Fusion
A key challenge in visual place recognition (VPR) is recognizing places
despite drastic visual appearance changes due to factors such as time of day,
season, weather or lighting conditions. Numerous approaches based on
deep-learnt image descriptors, sequence matching, domain translation, and
probabilistic localization have had success in addressing this challenge, but
most rely on the availability of carefully curated representative reference
images of the possible places. In this paper, we propose a novel approach,
dubbed Bayesian Selective Fusion, for actively selecting and fusing informative
reference images to determine the best place match for a given query image. The
selective element of our approach avoids the counterproductive fusion of every
reference image and enables the dynamic selection of informative reference
images in environments with changing visual conditions (such as indoors with
flickering lights, outdoors during sunshowers or over the day-night cycle). The
probabilistic element of our approach provides a means of fusing multiple
reference images that accounts for their varying uncertainty via a novel
training-free likelihood function for VPR. On difficult query images from two
benchmark datasets, we demonstrate that our approach matches and exceeds the
performance of several alternative fusion approaches along with
state-of-the-art techniques that are provided with prior (unfair) knowledge of
the best reference images. Our approach is well suited for long-term robot
autonomy where dynamic visual environments are commonplace since it is
training-free, descriptor-agnostic, and complements existing techniques such as
sequence matching.Comment: 8 pages, 10 figures, accepted in the IEEE Robotics and Automation
Letter
Coresets for Visual Summarization with Applications to Loop Closure
Abstract-In continuously operating robotic systems, efficient representation of the previously seen camera feed is crucial. Using a highly efficient compression coreset method, we formulate a new method for hierarchical retrieval of frames from large video streams collected online by a moving robot. We demonstrate how to utilize the resulting structure for efficient loop-closure by a novel sampling approach that is adaptive to the structure of the video. The same structure also allows us to create a highly-effective search tool for large-scale videos, which we demonstrate in this paper. We show the efficiency of proposed approaches for retrieval and loop closure on standard datasets, and on a large-scale video from a mobile camera
Coresets for Time Series Clustering
We study the problem of constructing coresets for clustering problems with time series data. This problem has gained importance across many fields including biology, medicine, and economics due to the proliferation of sensors for real-time measurement and rapid drop in storage costs. In particular, we consider the setting where the time series data on N entities is generated from a Gaussian mixture model with autocorrelations over k clusters in Rd. Our main contribution is an algorithm to construct coresets for the maximum likelihood objective for this mixture model. Our algorithm is efficient, and, under a mild assumption on the covariance matrices of the Gaussians, the size of the coreset is independent of the number of entities N and the number of observations for each entity, and depends only polynomially on k, d and 1/ε, where ε is the error parameter. We empirically assess the performance of our coresets with synthetic data
Coresets for Time Series Clustering
We study the problem of constructing coresets for clustering problems with time series data. This problem has gained importance across many fields including biology, medicine, and economics due to the proliferation of sensors for real-time measurement and rapid drop in storage costs. In particular, we consider the setting where the time series data on N entities is generated from a Gaussian mixture model with autocorrelations over k clusters in Rd. Our main contribution is an algorithm to construct coresets for the maximum likelihood objective for this mixture model. Our algorithm is efficient, and, under a mild assumption on the covariance matrices of the Gaussians, the size of the coreset is independent of the number of entities N and the number of observations for each entity, and depends only polynomially on k, d and 1/ε, where ε is the error parameter. We empirically assess the performance of our coresets with synthetic data