17 research outputs found
Implicit Density Estimation by Local Moment Matching to Sample from Auto-Encoders
Recent work suggests that some auto-encoder variants do a good job of
capturing the local manifold structure of the unknown data generating density.
This paper contributes to the mathematical understanding of this phenomenon and
helps define better justified sampling algorithms for deep learning based on
auto-encoder variants. We consider an MCMC where each step samples from a
Gaussian whose mean and covariance matrix depend on the previous state, defines
through its asymptotic distribution a target density. First, we show that good
choices (in the sense of consistency) for these mean and covariance functions
are the local expected value and local covariance under that target density.
Then we show that an auto-encoder with a contractive penalty captures
estimators of these local moments in its reconstruction function and its
Jacobian. A contribution of this work is thus a novel alternative to
maximum-likelihood density estimation, which we call local moment matching. It
also justifies a recently proposed sampling algorithm for the Contractive
Auto-Encoder and extends it to the Denoising Auto-Encoder
Local Component Analysis
Kernel density estimation, a.k.a. Parzen windows, is a popular density
estimation method, which can be used for outlier detection or clustering. With
multivariate data, its performance is heavily reliant on the metric used within
the kernel. Most earlier work has focused on learning only the bandwidth of the
kernel (i.e., a scalar multiplicative factor). In this paper, we propose to
learn a full Euclidean metric through an expectation-minimization (EM)
procedure, which can be seen as an unsupervised counterpart to neighbourhood
component analysis (NCA). In order to avoid overfitting with a fully
nonparametric density estimator in high dimensions, we also consider a
semi-parametric Gaussian-Parzen density model, where some of the variables are
modelled through a jointly Gaussian density, while others are modelled through
Parzen windows. For these two models, EM leads to simple closed-form updates
based on matrix inversions and eigenvalue decompositions. We show empirically
that our method leads to density estimators with higher test-likelihoods than
natural competing methods, and that the metrics may be used within most
unsupervised learning techniques that rely on such metrics, such as spectral
clustering or manifold learning methods. Finally, we present a stochastic
approximation scheme which allows for the use of this method in a large-scale
setting
A Graph-Based Semi-Supervised k Nearest-Neighbor Method for Nonlinear Manifold Distributed Data Classification
Nearest Neighbors (NN) is one of the most widely used supervised
learning algorithms to classify Gaussian distributed data, but it does not
achieve good results when it is applied to nonlinear manifold distributed data,
especially when a very limited amount of labeled samples are available. In this
paper, we propose a new graph-based NN algorithm which can effectively
handle both Gaussian distributed data and nonlinear manifold distributed data.
To achieve this goal, we first propose a constrained Tired Random Walk (TRW) by
constructing an -level nearest-neighbor strengthened tree over the graph,
and then compute a TRW matrix for similarity measurement purposes. After this,
the nearest neighbors are identified according to the TRW matrix and the class
label of a query point is determined by the sum of all the TRW weights of its
nearest neighbors. To deal with online situations, we also propose a new
algorithm to handle sequential samples based a local neighborhood
reconstruction. Comparison experiments are conducted on both synthetic data
sets and real-world data sets to demonstrate the validity of the proposed new
NN algorithm and its improvements to other version of NN algorithms.
Given the widespread appearance of manifold structures in real-world problems
and the popularity of the traditional NN algorithm, the proposed manifold
version NN shows promising potential for classifying manifold-distributed
data.Comment: 32 pages, 12 figures, 7 table
Novel Perspectives and Applications of Knowledge Graph Embeddings: From Link Prediction to Risk Assessment and Explainability
Knowledge graph representation is an important embedding technology that supports a variety of machine learning related applications. By learning the distributed representation of multi-relational data, knowledge embedding models are supposed to efficiently deal with the semantic relatedness of their constituents. However, failing in the fundamental task of creating an appropriate form to represent knowledge harms any attempt of designing subsequent machine learning tasks. Several knowledge embedding methods have been proposed in the last decade. Although there is a consensus on the idea that enhanced approaches are more efficient, more complex projections in the hyperspace that indeed favor link prediction (or knowledge graph completion) can result in a loss of semantic similarity. We propose a new evaluation task that aims at performing risk assessment on domain-specific categorized multi-relational datasets, designed as a classification problem based on the resulting embeddings. We assess the quality of embedding representations based on the synergy of the resulting clusters of target subjects. We show that more sophisticated embedding approaches do not necessarily favor embedding quality, and the traditional link prediction validation protocol is a weak metric to measure the quality of embedding representation. Finally, we present insights about using the synergy analysis to provide risk assessment explainability based on the probability distribution of feature-value pairs within embedded clusters
Local Component Analysis
International audienceKernel density estimation, a.k.a. Parzen windows, is a popular density estimation method, which can be used for outlier detection or clustering. With multivariate data, its performance is heavily reliant on the metric used within the kernel. Most earlier work has focused on learning only the bandwidth of the kernel (i.e., a scalar multiplicative factor). In this paper, we propose to learn a full Euclidean metric through an expectation-minimization (EM) procedure, which can be seen as an unsupervised counterpart to neighbourhood component analysis (NCA). In order to avoid overfitting with a fully nonparametric density estimator in high dimensions, we also consider a semi-parametric Gaussian-Parzen density model, where some of the variables are modelled through a jointly Gaussian density, while others are modelled through Parzen windows. For these two models, EM leads to simple closed-form updates based on matrix inversions and eigenvalue decompositions. We show empirically that our method leads to density estimators with higher test-likelihoods than natural competing methods, and that the metrics may be used within most unsupervised learning techniques that rely on such metrics, such as spectral clustering or manifold learning methods. Finally, we present a stochastic approximation scheme which allows for the use of this method in a large-scale setting