9,276 research outputs found
Maximum Inner-Product Search using Tree Data-structures
The problem of {\em efficiently} finding the best match for a query in a
given set with respect to the Euclidean distance or the cosine similarity has
been extensively studied in literature. However, a closely related problem of
efficiently finding the best match with respect to the inner product has never
been explored in the general setting to the best of our knowledge. In this
paper we consider this general problem and contrast it with the existing
best-match algorithms. First, we propose a general branch-and-bound algorithm
using a tree data structure. Subsequently, we present a dual-tree algorithm for
the case where there are multiple queries. Finally we present a new data
structure for increasing the efficiency of the dual-tree algorithm. These
branch-and-bound algorithms involve novel bounds suited for the purpose of
best-matching with inner products. We evaluate our proposed algorithms on a
variety of data sets from various applications, and exhibit up to five orders
of magnitude improvement in query time over the naive search technique.Comment: Under submission in KDD 201
On the Sample Complexity of Predictive Sparse Coding
The goal of predictive sparse coding is to learn a representation of examples
as sparse linear combinations of elements from a dictionary, such that a
learned hypothesis linear in the new representation performs well on a
predictive task. Predictive sparse coding algorithms recently have demonstrated
impressive performance on a variety of supervised tasks, but their
generalization properties have not been studied. We establish the first
generalization error bounds for predictive sparse coding, covering two
settings: 1) the overcomplete setting, where the number of features k exceeds
the original dimensionality d; and 2) the high or infinite-dimensional setting,
where only dimension-free bounds are useful. Both learning bounds intimately
depend on stability properties of the learned sparse encoder, as measured on
the training sample. Consequently, we first present a fundamental stability
result for the LASSO, a result characterizing the stability of the sparse codes
with respect to perturbations to the dictionary. In the overcomplete setting,
we present an estimation error bound that decays as \tilde{O}(sqrt(d k/m)) with
respect to d and k. In the high or infinite-dimensional setting, we show a
dimension-free bound that is \tilde{O}(sqrt(k^2 s / m)) with respect to k and
s, where s is an upper bound on the number of non-zeros in the sparse code for
any training data point.Comment: Sparse Coding Stability Theorem from version 1 has been relaxed
considerably using a new notion of coding margin. Old Sparse Coding Stability
Theorem still in new version, now as Theorem 2. Presentation of all proofs
simplified/improved considerably. Paper reorganized. Empirical analysis
showing new coding margin is non-trivial on real dataset
Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTOBAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA
Multibody Multipole Methods
A three-body potential function can account for interactions among triples of
particles which are uncaptured by pairwise interaction functions such as
Coulombic or Lennard-Jones potentials. Likewise, a multibody potential of order
can account for interactions among -tuples of particles uncaptured by
interaction functions of lower orders. To date, the computation of multibody
potential functions for a large number of particles has not been possible due
to its scaling cost. In this paper we describe a fast tree-code for
efficiently approximating multibody potentials that can be factorized as
products of functions of pairwise distances. For the first time, we show how to
derive a Barnes-Hut type algorithm for handling interactions among more than
two particles. Our algorithm uses two approximation schemes: 1) a deterministic
series expansion-based method; 2) a Monte Carlo-based approximation based on
the central limit theorem. Our approach guarantees a user-specified bound on
the absolute or relative error in the computed potential with an asymptotic
probability guarantee. We provide speedup results on a three-body dispersion
potential, the Axilrod-Teller potential.Comment: To appear in Journal of Computational Physic
- …