1,426 research outputs found
(Quasi)Periodicity Quantification in Video Data, Using Topology
This work introduces a novel framework for quantifying the presence and
strength of recurrent dynamics in video data. Specifically, we provide
continuous measures of periodicity (perfect repetition) and quasiperiodicity
(superposition of periodic modes with non-commensurate periods), in a way which
does not require segmentation, training, object tracking or 1-dimensional
surrogate signals. Our methodology operates directly on video data. The
approach combines ideas from nonlinear time series analysis (delay embeddings)
and computational topology (persistent homology), by translating the problem of
finding recurrent dynamics in video data, into the problem of determining the
circularity or toroidality of an associated geometric space. Through extensive
testing, we show the robustness of our scores with respect to several noise
models/levels, we show that our periodicity score is superior to other methods
when compared to human-generated periodicity rankings, and furthermore, we show
that our quasiperiodicity score clearly indicates the presence of biphonation
in videos of vibrating vocal folds, which has never before been accomplished
end to end quantitatively.Comment: 27 pages, 1 column, 23 figures, SIAM Journal on Imaging Sciences,
201
Row Sampling by Lewis Weights
We give a simple algorithm to efficiently sample the rows of a matrix while
preserving the p-norms of its product with vectors. Given an -by- matrix
, we find with high probability and in input sparsity
time an consisting of about rescaled
rows of such that is close to for all vectors . We
also show similar results for all that give nearly optimal sample
bounds in input sparsity time. Our results are based on sampling by "Lewis
weights", which can be viewed as statistical leverage scores of a reweighted
matrix. We also give an elementary proof of the guarantees of this sampling
process for
Coresets and Sketches
Geometric data summarization has become an essential tool in both geometric
approximation algorithms and where geometry intersects with big data problems.
In linear or near-linear time large data sets can be compressed into a summary,
and then more intricate algorithms can be run on the summaries whose results
approximate those of the full data set. Coresets and sketches are the two most
important classes of these summaries. We survey five types of coresets and
sketches: shape-fitting, density estimation, high-dimensional vectors,
high-dimensional point sets / matrices, and clustering.Comment: Near-final version of Chapter 49 in Handbook on Discrete and
Computational Geometry, 3rd editio
Learning Latent Variable Gaussian Graphical Models
Gaussian graphical models (GGM) have been widely used in many
high-dimensional applications ranging from biological and financial data to
recommender systems. Sparsity in GGM plays a central role both statistically
and computationally. Unfortunately, real-world data often does not fit well to
sparse graphical models. In this paper, we focus on a family of latent variable
Gaussian graphical models (LVGGM), where the model is conditionally sparse
given latent variables, but marginally non-sparse. In LVGGM, the inverse
covariance matrix has a low-rank plus sparse structure, and can be learned in a
regularized maximum likelihood framework. We derive novel parameter estimation
error bounds for LVGGM under mild conditions in the high-dimensional setting.
These results complement the existing theory on the structural learning, and
open up new possibilities of using LVGGM for statistical inference.Comment: To appear in The 31st International Conference on Machine Learning
(ICML 2014
CLEAR: A Consistent Lifting, Embedding, and Alignment Rectification Algorithm for Multi-View Data Association
Many robotics applications require alignment and fusion of observations
obtained at multiple views to form a global model of the environment. Multi-way
data association methods provide a mechanism to improve alignment accuracy of
pairwise associations and ensure their consistency. However, existing methods
that solve this computationally challenging problem are often too slow for
real-time applications. Furthermore, some of the existing techniques can
violate the cycle consistency principle, thus drastically reducing the fusion
accuracy. This work presents the CLEAR (Consistent Lifting, Embedding, and
Alignment Rectification) algorithm to address these issues. By leveraging
insights from the multi-way matching and spectral graph clustering literature,
CLEAR provides cycle consistent and accurate solutions in a computationally
efficient manner. Numerical experiments on both synthetic and real datasets are
carried out to demonstrate the scalability and superior performance of our
algorithm in real-world problems. This algorithmic framework can provide
significant improvement in the accuracy and efficiency of existing discrete
assignment problems, which traditionally use pairwise (but potentially
inconsistent) correspondences. An implementation of CLEAR is made publicly
available online
Hinge-Loss Markov Random Fields and Probabilistic Soft Logic
A fundamental challenge in developing high-impact machine learning
technologies is balancing the need to model rich, structured domains with the
ability to scale to big data. Many important problem areas are both richly
structured and large scale, from social and biological networks, to knowledge
graphs and the Web, to images, video, and natural language. In this paper, we
introduce two new formalisms for modeling structured data, and show that they
can both capture rich structure and scale to big data. The first, hinge-loss
Markov random fields (HL-MRFs), is a new kind of probabilistic graphical model
that generalizes different approaches to convex inference. We unite three
approaches from the randomized algorithms, probabilistic graphical models, and
fuzzy logic communities, showing that all three lead to the same inference
objective. We then define HL-MRFs by generalizing this unified objective. The
second new formalism, probabilistic soft logic (PSL), is a probabilistic
programming language that makes HL-MRFs easy to define using a syntax based on
first-order logic. We introduce an algorithm for inferring most-probable
variable assignments (MAP inference) that is much more scalable than
general-purpose convex optimization methods, because it uses message passing to
take advantage of sparse dependency structures. We then show how to learn the
parameters of HL-MRFs. The learned HL-MRFs are as accurate as analogous
discrete models, but much more scalable. Together, these algorithms enable
HL-MRFs and PSL to model rich, structured data at scales not previously
possible
Perturbations of Christoffel-Darboux kernels. I: detection of outliers
Two central objects in constructive approximation, the Christoffel-Darboux
kernel and the Christoffel function, are encoding ample information about the
associated moment data and ultimately about the possible generating measures.
We develop a multivariate theory of the Christoffel-Darboux kernel in C^d, with
emphasis on the perturbation of Christoffel functions and their level sets with
respect to perturbations of small norm or low rank. The statistical notion of
leverage score provides a quantitative criterion for the detection of outliers
in large data. Using the refined theory of Bergman orthogonal polynomials, we
illustrate the main results, including some numerical simulations, in the case
of finite atomic perturbations of area measure of a 2D region. Methods of
function theory of a complex variable and (pluri)potential theory are widely
used in the derivation of our perturbation formulas.Comment: second version, 53 page
Conjecturing-Based Computational Discovery of Patterns in Data
Modern machine learning methods are designed to exploit complex patterns in
data regardless of their form, while not necessarily revealing them to the
investigator. Here we demonstrate situations where modern machine learning
methods are ill-equipped to reveal feature interaction effects and other
nonlinear relationships. We propose the use of a conjecturing machine that
generates feature relationships in the form of bounds for numerical features
and boolean expressions for nominal features that are ignored by machine
learning algorithms. The proposed framework is demonstrated for a
classification problem with an interaction effect and a nonlinear regression
problem. In both settings, true underlying relationships are revealed and
generalization performance improves. The framework is then applied to
patient-level data regarding COVID-19 outcomes to suggest possible risk
factors.Comment: 25 pages, 6 figure
How to Train Your Energy-Based Models
Energy-Based Models (EBMs), also known as non-normalized probabilistic
models, specify probability density or mass functions up to an unknown
normalizing constant. Unlike most other probabilistic models, EBMs do not place
a restriction on the tractability of the normalizing constant, thus are more
flexible to parameterize and can model a more expressive family of probability
distributions. However, the unknown normalizing constant of EBMs makes training
particularly difficult. Our goal is to provide a friendly introduction to
modern approaches for EBM training. We start by explaining maximum likelihood
training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on
MCMC-free approaches, including Score Matching (SM) and Noise Constrastive
Estimation (NCE). We highlight theoretical connections among these three
approaches, and end with a brief survey on alternative training methods, which
are still under active research. Our tutorial is targeted at an audience with
basic understanding of generative models who want to apply EBMs or start a
research project in this direction
- β¦