95,170 research outputs found
A Review for Weighted MinHash Algorithms
Data similarity (or distance) computation is a fundamental research topic
which underpins many high-level applications based on similarity measures in
machine learning and data mining. However, in large-scale real-world scenarios,
the exact similarity computation has become daunting due to "3V" nature
(volume, velocity and variety) of big data. In such cases, the hashing
techniques have been verified to efficiently conduct similarity estimation in
terms of both theory and practice. Currently, MinHash is a popular technique
for efficiently estimating the Jaccard similarity of binary sets and
furthermore, weighted MinHash is generalized to estimate the generalized
Jaccard similarity of weighted sets. This review focuses on categorizing and
discussing the existing works of weighted MinHash algorithms. In this review,
we mainly categorize the Weighted MinHash algorithms into quantization-based
approaches, "active index"-based ones and others, and show the evolution and
inherent connection of the weighted MinHash algorithms, from the integer
weighted MinHash algorithms to real-valued weighted MinHash ones (particularly
the Consistent Weighted Sampling scheme). Also, we have developed a python
toolbox for the algorithms, and released it in our github. Based on the
toolbox, we experimentally conduct a comprehensive comparative study of the
standard MinHash algorithm and the weighted MinHash ones
Bayesian Neighbourhood Component Analysis
Learning a good distance metric in feature space potentially improves the
performance of the KNN classifier and is useful in many real-world
applications. Many metric learning algorithms are however based on the point
estimation of a quadratic optimization problem, which is time-consuming,
susceptible to overfitting, and lack a natural mechanism to reason with
parameter uncertainty, an important property useful especially when the
training set is small and/or noisy. To deal with these issues, we present a
novel Bayesian metric learning method, called Bayesian NCA, based on the
well-known Neighbourhood Component Analysis method, in which the metric
posterior is characterized by the local label consistency constraints of
observations, encoded with a similarity graph instead of independent pairwise
constraints. For efficient Bayesian optimization, we explore the variational
lower bound over the log-likelihood of the original NCA objective. Experiments
on several publicly available datasets demonstrate that the proposed method is
able to learn robust metric measures from small size dataset and/or from
challenging training set with labels contaminated by errors. The proposed
method is also shown to outperform a previous pairwise constrained Bayesian
metric learning method
Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data
In this paper we develop a data-driven smoothing technique for
high-dimensional and non-linear panel data models. We allow for individual
specific (non-linear) functions and estimation with econometric or machine
learning methods by using weighted observations from other individuals. The
weights are determined by a data-driven way and depend on the similarity
between the corresponding functions and are measured based on initial
estimates. The key feature of such a procedure is that it clusters individuals
based on the distance / similarity between them, estimated in a first stage.
Our estimation method can be combined with various statistical estimation
procedures, in particular modern machine learning methods which are in
particular fruitful in the high-dimensional case and with complex,
heterogeneous data. The approach can be interpreted as a \textquotedblleft
soft-clustering\textquotedblright\ in comparison to
traditional\textquotedblleft\ hard clustering\textquotedblright that assigns
each individual to exactly one group. We conduct a simulation study which shows
that the prediction can be greatly improved by using our estimator. Finally, we
analyze a big data set from didichuxing.com, a leading company in
transportation industry, to analyze and predict the gap between supply and
demand based on a large set of covariates. Our estimator clearly performs much
better in out-of-sample prediction compared to existing linear panel data
estimators.Comment: 18 pages, 1 figure, 6 table
Query-driven learning for predictive analytics of data subspace cardinality
Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches
Generalization Bounds for Metric and Similarity Learning
Recently, metric learning and similarity learning have attracted a large
amount of interest. Many models and optimisation algorithms have been proposed.
However, there is relatively little work on the generalization analysis of such
methods. In this paper, we derive novel generalization bounds of metric and
similarity learning. In particular, we first show that the generalization
analysis reduces to the estimation of the Rademacher average over
"sums-of-i.i.d." sample-blocks related to the specific matrix norm. Then, we
derive generalization bounds for metric/similarity learning with different
matrix-norm regularisers by estimating their specific Rademacher complexities.
Our analysis indicates that sparse metric/similarity learning with -norm
regularisation could lead to significantly better bounds than those with
Frobenius-norm regularisation. Our novel generalization analysis develops and
refines the techniques of U-statistics and Rademacher complexity analysis.Comment: 20 page
Guaranteed Classification via Regularized Similarity Learning
Learning an appropriate (dis)similarity function from the available data is a
central problem in machine learning, since the success of many machine learning
algorithms critically depends on the choice of a similarity function to compare
examples. Despite many approaches for similarity metric learning have been
proposed, there is little theoretical study on the links between similarity
met- ric learning and the classification performance of the result classifier.
In this paper, we propose a regularized similarity learning formulation
associated with general matrix-norms, and establish their generalization
bounds. We show that the generalization error of the resulting linear separator
can be bounded by the derived generalization bound of similarity learning. This
shows that a good gen- eralization of the learnt similarity function guarantees
a good classification of the resulting linear classifier. Our results extend
and improve those obtained by Bellet at al. [3]. Due to the techniques
dependent on the notion of uniform stability [6], the bound obtained there
holds true only for the Frobenius matrix- norm regularization. Our techniques
using the Rademacher complexity [5] and its related Khinchin-type inequality
enable us to establish bounds for regularized similarity learning formulations
associated with general matrix-norms including sparse L 1 -norm and mixed
(2,1)-norm
Similarity-based Text Recognition by Deeply Supervised Siamese Network
In this paper, we propose a new text recognition model based on measuring the
visual similarity of text and predicting the content of unlabeled texts. First
a Siamese convolutional network is trained with deep supervision on a labeled
training dataset. This network projects texts into a similarity manifold. The
Deeply Supervised Siamese network learns visual similarity of texts. Then a
K-nearest neighbor classifier is used to predict unlabeled text based on
similarity distance to labeled texts. The performance of the model is evaluated
on three datasets of machine-print and hand-written text combined. We
demonstrate that the model reduces the cost of human estimation by .
The error of the system is less than . The proposed model outperform
conventional Siamese network by finding visually-similar barely-readable and
readable text, e.g. machine-printed, handwritten, due to deep supervision. The
results also demonstrate that the predicted labels are sometimes better than
human labels e.g. spelling correction.Comment: Accepted for presenting at Future Technologies Conference - (FTC
2016) San Francisco, December 6-7, 201
Learned Multi-Patch Similarity
Estimating a depth map from multiple views of a scene is a fundamental task
in computer vision. As soon as more than two viewpoints are available, one
faces the very basic question how to measure similarity across >2 image
patches. Surprisingly, no direct solution exists, instead it is common to fall
back to more or less robust averaging of two-view similarities. Encouraged by
the success of machine learning, and in particular convolutional neural
networks, we propose to learn a matching function which directly maps multiple
image patches to a scalar similarity score. Experiments on several multi-view
datasets demonstrate that this approach has advantages over methods based on
pairwise patch similarity.Comment: 10 pages, 7 figures, Accepted at ICCV 201
Dual-reference Face Retrieval
Face retrieval has received much attention over the past few decades, and
many efforts have been made in retrieving face images against pose,
illumination, and expression variations. However, the conventional works fail
to meet the requirements of a potential and novel task --- retrieving a
person's face image at a specific age, especially when the specific 'age' is
not given as a numeral, i.e. 'retrieving someone's image at the similar age
period shown by another person's image'. To tackle this problem, we propose a
dual reference face retrieval framework in this paper, where the system takes
two inputs: an identity reference image which indicates the target identity and
an age reference image which reflects the target age. In our framework, the raw
images are first projected on a joint manifold, which preserves both the age
and identity locality. Then two similarity metrics of age and identity are
exploited and optimized by utilizing our proposed quartet-based model. The
experiments show promising results, outperforming hierarchical methods.Comment: Accepted at AAAI 201
Deep Convolutional Networks on Graph-Structured Data
Deep Learning's recent successes have mostly relied on Convolutional
Networks, which exploit fundamental statistical properties of images, sounds
and video data: the local stationarity and multi-scale compositional structure,
that allows expressing long range interactions in terms of shorter, localized
interactions. However, there exist other important examples, such as text
documents or bioinformatic data, that may lack some or all of these strong
statistical regularities.
In this paper we consider the general question of how to construct deep
architectures with small learning complexity on general non-Euclidean domains,
which are typically unknown and need to be estimated from the data. In
particular, we develop an extension of Spectral Networks which incorporates a
Graph Estimation procedure, that we test on large-scale classification
problems, matching or improving over Dropout Networks with far less parameters
to estimate
- …