Search CORE

95,170 research outputs found

A Review for Weighted MinHash Algorithms

Author: Chen Ling
Gao Junbin
Li Bin
Wu Wei
Zhang Chengqi
Publication venue
Publication date: 12/11/2018
Field of study

Data similarity (or distance) computation is a fundamental research topic which underpins many high-level applications based on similarity measures in machine learning and data mining. However, in large-scale real-world scenarios, the exact similarity computation has become daunting due to "3V" nature (volume, velocity and variety) of big data. In such cases, the hashing techniques have been verified to efficiently conduct similarity estimation in terms of both theory and practice. Currently, MinHash is a popular technique for efficiently estimating the Jaccard similarity of binary sets and furthermore, weighted MinHash is generalized to estimate the generalized Jaccard similarity of weighted sets. This review focuses on categorizing and discussing the existing works of weighted MinHash algorithms. In this review, we mainly categorize the Weighted MinHash algorithms into quantization-based approaches, "active index"-based ones and others, and show the evolution and inherent connection of the weighted MinHash algorithms, from the integer weighted MinHash algorithms to real-valued weighted MinHash ones (particularly the Consistent Weighted Sampling scheme). Also, we have developed a python toolbox for the algorithms, and released it in our github. Based on the toolbox, we experimentally conduct a comprehensive comparative study of the standard MinHash algorithm and the weighted MinHash ones

arXiv.org e-Print Archive

Bayesian Neighbourhood Component Analysis

Author: Tan Xiaoyang
Wang Dong
Publication venue
Publication date: 08/04/2016
Field of study

Learning a good distance metric in feature space potentially improves the performance of the KNN classifier and is useful in many real-world applications. Many metric learning algorithms are however based on the point estimation of a quadratic optimization problem, which is time-consuming, susceptible to overfitting, and lack a natural mechanism to reason with parameter uncertainty, an important property useful especially when the training set is small and/or noisy. To deal with these issues, we present a novel Bayesian metric learning method, called Bayesian NCA, based on the well-known Neighbourhood Component Analysis method, in which the metric posterior is characterized by the local label consistency constraints of observations, encoded with a similarity graph instead of independent pairwise constraints. For efficient Bayesian optimization, we explore the variational lower bound over the log-likelihood of the original NCA objective. Experiments on several publicly available datasets demonstrate that the proposed method is able to learn robust metric measures from small size dataset and/or from challenging training set with labels contaminated by errors. The proposed method is also shown to outperform a previous pairwise constrained Bayesian metric learning method

arXiv.org e-Print Archive

Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data

Author: Chen Xi
Luo Ye
Spindler Martin
Publication venue
Publication date: 03/01/2020
Field of study

In this paper we develop a data-driven smoothing technique for high-dimensional and non-linear panel data models. We allow for individual specific (non-linear) functions and estimation with econometric or machine learning methods by using weighted observations from other individuals. The weights are determined by a data-driven way and depend on the similarity between the corresponding functions and are measured based on initial estimates. The key feature of such a procedure is that it clusters individuals based on the distance / similarity between them, estimated in a first stage. Our estimation method can be combined with various statistical estimation procedures, in particular modern machine learning methods which are in particular fruitful in the high-dimensional case and with complex, heterogeneous data. The approach can be interpreted as a \textquotedblleft soft-clustering\textquotedblright\ in comparison to traditional\textquotedblleft\ hard clustering\textquotedblright that assigns each individual to exactly one group. We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator. Finally, we analyze a big data set from didichuxing.com, a leading company in transportation industry, to analyze and predict the gap between supply and demand based on a large set of covariates. Our estimator clearly performs much better in out-of-sample prediction compared to existing linear panel data estimators.Comment: 18 pages, 1 figure, 6 table

arXiv.org e-Print Archive

Query-driven learning for predictive analytics of data subspace cardinality

Author: Anagnostopoulos Christos
Triantafillou Peter
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/06/2017
Field of study

Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

Warwick Research Archives Portal Repository

Enlighten

Generalization Bounds for Metric and Similarity Learning

Author: Cao Qiong
Guo Zheng-Chu
Ying Yiming
Publication venue
Publication date: 17/03/2013
Field of study

Recently, metric learning and similarity learning have attracted a large amount of interest. Many models and optimisation algorithms have been proposed. However, there is relatively little work on the generalization analysis of such methods. In this paper, we derive novel generalization bounds of metric and similarity learning. In particular, we first show that the generalization analysis reduces to the estimation of the Rademacher average over "sums-of-i.i.d." sample-blocks related to the specific matrix norm. Then, we derive generalization bounds for metric/similarity learning with different matrix-norm regularisers by estimating their specific Rademacher complexities. Our analysis indicates that sparse metric/similarity learning with

L^1

-norm regularisation could lead to significantly better bounds than those with Frobenius-norm regularisation. Our novel generalization analysis develops and refines the techniques of U-statistics and Rademacher complexity analysis.Comment: 20 page

arXiv.org e-Print Archive

Guaranteed Classification via Regularized Similarity Learning

Author: Guo Zheng-Chu
Ying Yiming
Publication venue
Publication date: 29/08/2013
Field of study

Learning an appropriate (dis)similarity function from the available data is a central problem in machine learning, since the success of many machine learning algorithms critically depends on the choice of a similarity function to compare examples. Despite many approaches for similarity metric learning have been proposed, there is little theoretical study on the links between similarity met- ric learning and the classification performance of the result classifier. In this paper, we propose a regularized similarity learning formulation associated with general matrix-norms, and establish their generalization bounds. We show that the generalization error of the resulting linear separator can be bounded by the derived generalization bound of similarity learning. This shows that a good gen- eralization of the learnt similarity function guarantees a good classification of the resulting linear classifier. Our results extend and improve those obtained by Bellet at al. [3]. Due to the techniques dependent on the notion of uniform stability [6], the bound obtained there holds true only for the Frobenius matrix- norm regularization. Our techniques using the Rademacher complexity [5] and its related Khinchin-type inequality enable us to establish bounds for regularized similarity learning formulations associated with general matrix-norms including sparse L 1 -norm and mixed (2,1)-norm

arXiv.org e-Print Archive

Similarity-based Text Recognition by Deeply Supervised Siamese Network

Author: Guha Angshuman
Hosseini-Asl Ehsan
Publication venue
Publication date: 04/07/2016
Field of study

In this paper, we propose a new text recognition model based on measuring the visual similarity of text and predicting the content of unlabeled texts. First a Siamese convolutional network is trained with deep supervision on a labeled training dataset. This network projects texts into a similarity manifold. The Deeply Supervised Siamese network learns visual similarity of texts. Then a K-nearest neighbor classifier is used to predict unlabeled text based on similarity distance to labeled texts. The performance of the model is evaluated on three datasets of machine-print and hand-written text combined. We demonstrate that the model reduces the cost of human estimation by

50\%-85\%

. The error of the system is less than

0.5\%

. The proposed model outperform conventional Siamese network by finding visually-similar barely-readable and readable text, e.g. machine-printed, handwritten, due to deep supervision. The results also demonstrate that the predicted labels are sometimes better than human labels e.g. spelling correction.Comment: Accepted for presenting at Future Technologies Conference - (FTC 2016) San Francisco, December 6-7, 201

arXiv.org e-Print Archive

Learned Multi-Patch Similarity

Author: Galliani Silvano
Hartmann Wilfried
Havlena Michal
Schindler Konrad
Van Gool Luc
Publication venue
Publication date: 01/01/2017
Field of study

Estimating a depth map from multiple views of a scene is a fundamental task in computer vision. As soon as more than two viewpoints are available, one faces the very basic question how to measure similarity across >2 image patches. Surprisingly, no direct solution exists, instead it is common to fall back to more or less robust averaging of two-view similarities. Encouraged by the success of machine learning, and in particular convolutional neural networks, we propose to learn a matching function which directly maps multiple image patches to a scalar similarity score. Experiments on several multi-view datasets demonstrate that this approach has advantages over methods based on pairwise patch similarity.Comment: 10 pages, 7 figures, Accepted at ICCV 201

arXiv.org e-Print Archive

Repository for Publications and Research Data

Dual-reference Face Retrieval

Author: Hu BingZhang
Shao Ling
Zheng Feng
Publication venue
Publication date: 22/11/2017
Field of study

Face retrieval has received much attention over the past few decades, and many efforts have been made in retrieving face images against pose, illumination, and expression variations. However, the conventional works fail to meet the requirements of a potential and novel task --- retrieving a person's face image at a specific age, especially when the specific 'age' is not given as a numeral, i.e. 'retrieving someone's image at the similar age period shown by another person's image'. To tackle this problem, we propose a dual reference face retrieval framework in this paper, where the system takes two inputs: an identity reference image which indicates the target identity and an age reference image which reflects the target age. In our framework, the raw images are first projected on a joint manifold, which preserves both the age and identity locality. Then two similarity metrics of age and identity are exploited and optimized by utilizing our proposed quartet-based model. The experiments show promising results, outperforming hierarchical methods.Comment: Accepted at AAAI 201

arXiv.org e-Print Archive

Deep Convolutional Networks on Graph-Structured Data

Author: Bruna Joan
Henaff Mikael
LeCun Yann
Publication venue
Publication date: 16/06/2015
Field of study

Deep Learning's recent successes have mostly relied on Convolutional Networks, which exploit fundamental statistical properties of images, sounds and video data: the local stationarity and multi-scale compositional structure, that allows expressing long range interactions in terms of shorter, localized interactions. However, there exist other important examples, such as text documents or bioinformatic data, that may lack some or all of these strong statistical regularities. In this paper we consider the general question of how to construct deep architectures with small learning complexity on general non-Euclidean domains, which are typically unknown and need to be estimated from the data. In particular, we develop an extension of Spectral Networks which incorporates a Graph Estimation procedure, that we test on large-scale classification problems, matching or improving over Dropout Networks with far less parameters to estimate

arXiv.org e-Print Archive