17,243 research outputs found
Semantic Information G Theory and Logical Bayesian Inference for Machine Learning
An important problem with machine learning is that when label number n\u3e2, it is very difficult to construct and optimize a group of learning functions, and we wish that optimized learning functions are still useful when prior distribution P(x) (where x is an instance) is changed. To resolve this problem, the semantic information G theory, Logical Bayesian Inference (LBI), and a group of Channel Matching (CM) algorithms together form a systematic solution. MultilabelMultilabel A semantic channel in the G theory consists of a group of truth functions or membership functions. In comparison with likelihood functions, Bayesian posteriors, and Logistic functions used by popular methods, membership functions can be more conveniently used as learning functions without the above problem. In Logical Bayesian Inference (LBI), every labelâs learning is independent. For Multilabel learning, we can directly obtain a group of optimized membership functions from a big enough sample with labels, without preparing different samples for different labels. A group of Channel Matching (CM) algorithms are developed for machine learning. For the Maximum Mutual Information (MMI) classification of three classes with Gaussian distributions on a two-dimensional feature space, 2-3 iterations can make mutual information between three classes and three labels surpass 99% of the MMI for most initial partitions. For mixture models, the Expectation-Maxmization (EM) algorithm is improved and becomes the CM-EM algorithm, which can outperform the EM algorithm when mixture ratios are imbalanced, or local convergence exists. The CM iteration algorithm needs to combine neural networks for MMI classifications on high-dimensional feature spaces. LBI needs further studies for the unification of statistics and logic
The Bernstein Function: A Unifying Framework of Nonconvex Penalization in Sparse Estimation
In this paper we study nonconvex penalization using Bernstein functions.
Since the Bernstein function is concave and nonsmooth at the origin, it can
induce a class of nonconvex functions for high-dimensional sparse estimation
problems. We derive a threshold function based on the Bernstein penalty and
give its mathematical properties in sparsity modeling. We show that a
coordinate descent algorithm is especially appropriate for penalized regression
problems with the Bernstein penalty. Additionally, we prove that the Bernstein
function can be defined as the concave conjugate of a -divergence and
develop a conjugate maximization algorithm for finding the sparse solution.
Finally, we particularly exemplify a family of Bernstein nonconvex penalties
based on a generalized Gamma measure and conduct empirical analysis for this
family
Sketching for Large-Scale Learning of Mixture Models
Learning parameters from voluminous data can be prohibitive in terms of
memory and computational requirements. We propose a "compressive learning"
framework where we estimate model parameters from a sketch of the training
data. This sketch is a collection of generalized moments of the underlying
probability distribution of the data. It can be computed in a single pass on
the training set, and is easily computable on streams or distributed datasets.
The proposed framework shares similarities with compressive sensing, which aims
at drastically reducing the dimension of high-dimensional signals while
preserving the ability to reconstruct them. To perform the estimation task, we
derive an iterative algorithm analogous to sparse reconstruction algorithms in
the context of linear inverse problems. We exemplify our framework with the
compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics
on the choice of the sketching procedure and theoretical guarantees of
reconstruction. We experimentally show on synthetic data that the proposed
algorithm yields results comparable to the classical Expectation-Maximization
(EM) technique while requiring significantly less memory and fewer computations
when the number of database elements is large. We further demonstrate the
potential of the approach on real large-scale data (over 10 8 training samples)
for the task of model-based speaker verification. Finally, we draw some
connections between the proposed framework and approximate Hilbert space
embedding of probability distributions using random features. We show that the
proposed sketching operator can be seen as an innovative method to design
translation-invariant kernels adapted to the analysis of GMMs. We also use this
theoretical framework to derive information preservation guarantees, in the
spirit of infinite-dimensional compressive sensing
Probability density estimation with tunable kernels using orthogonal forward regression
A generalized or tunable-kernel model is proposed for probability density function estimation based on an orthogonal forward regression procedure. Each stage of the density estimation process determines a tunable kernel, namely, its center vector and diagonal covariance matrix, by minimizing a leave-one-out test criterion. The kernel mixing weights of the constructed sparse density estimate are finally updated using the multiplicative nonnegative quadratic programming algorithm to ensure the nonnegative and unity constraints, and this weight-updating process additionally has the desired ability to further reduce the model size. The proposed tunable-kernel model has advantages, in terms of model generalization capability and model sparsity, over the standard fixed-kernel model that restricts kernel centers to the training data points and employs a single common kernel variance for every kernel. On the other hand, it does not optimize all the model parameters together and thus avoids the problems of high-dimensional ill-conditioned nonlinear optimization associated with the conventional finite mixture model. Several examples are included to demonstrate the ability of the proposed novel tunable-kernel model to effectively construct a very compact density estimate accurately
Algorithms for nonnegative matrix factorization with the beta-divergence
This paper describes algorithms for nonnegative matrix factorization (NMF)
with the beta-divergence (beta-NMF). The beta-divergence is a family of cost
functions parametrized by a single shape parameter beta that takes the
Euclidean distance, the Kullback-Leibler divergence and the Itakura-Saito
divergence as special cases (beta = 2,1,0, respectively). The proposed
algorithms are based on a surrogate auxiliary function (a local majorization of
the criterion function). We first describe a majorization-minimization (MM)
algorithm that leads to multiplicative updates, which differ from standard
heuristic multiplicative updates by a beta-dependent power exponent. The
monotonicity of the heuristic algorithm can however be proven for beta in (0,1)
using the proposed auxiliary function. Then we introduce the concept of
majorization-equalization (ME) algorithm which produces updates that move along
constant level sets of the auxiliary function and lead to larger steps than MM.
Simulations on synthetic and real data illustrate the faster convergence of the
ME approach. The paper also describes how the proposed algorithms can be
adapted to two common variants of NMF : penalized NMF (i.e., when a penalty
function of the factors is added to the criterion function) and convex-NMF
(when the dictionary is assumed to belong to a known subspace).Comment: \`a para\^itre dans Neural Computatio
- âŠ