151,793 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)
Recently it was shown that the problem of Maximum Inner Product Search (MIPS)
is efficient and it admits provably sub-linear hashing algorithms. Asymmetric
transformations before hashing were the key in solving MIPS which was otherwise
hard. In the prior work, the authors use asymmetric transformations which
convert the problem of approximate MIPS into the problem of approximate near
neighbor search which can be efficiently solved using hashing. In this work, we
provide a different transformation which converts the problem of approximate
MIPS into the problem of approximate cosine similarity search which can be
efficiently solved using signed random projections. Theoretical analysis show
that the new scheme is significantly better than the original scheme for MIPS.
Experimental evaluations strongly support the theoretical findings.Comment: arXiv admin note: text overlap with arXiv:1405.586
Maximum Inner-Product Search using Tree Data-structures
The problem of {\em efficiently} finding the best match for a query in a
given set with respect to the Euclidean distance or the cosine similarity has
been extensively studied in literature. However, a closely related problem of
efficiently finding the best match with respect to the inner product has never
been explored in the general setting to the best of our knowledge. In this
paper we consider this general problem and contrast it with the existing
best-match algorithms. First, we propose a general branch-and-bound algorithm
using a tree data structure. Subsequently, we present a dual-tree algorithm for
the case where there are multiple queries. Finally we present a new data
structure for increasing the efficiency of the dual-tree algorithm. These
branch-and-bound algorithms involve novel bounds suited for the purpose of
best-matching with inner products. We evaluate our proposed algorithms on a
variety of data sets from various applications, and exhibit up to five orders
of magnitude improvement in query time over the naive search technique.Comment: Under submission in KDD 201
Penalized estimation in large-scale generalized linear array models
Large-scale generalized linear array models (GLAMs) can be challenging to
fit. Computation and storage of its tensor product design matrix can be
impossible due to time and memory constraints, and previously considered design
matrix free algorithms do not scale well with the dimension of the parameter
vector. A new design matrix free algorithm is proposed for computing the
penalized maximum likelihood estimate for GLAMs, which, in particular, handles
nondifferentiable penalty functions. The proposed algorithm is implemented and
available via the R package \verb+glamlasso+. It combines several ideas --
previously considered separately -- to obtain sparse estimates while at the
same time efficiently exploiting the GLAM structure. In this paper the
convergence of the algorithm is treated and the performance of its
implementation is investigated and compared to that of \verb+glmnet+ on
simulated as well as real data. It is shown that the computation time fo
- …