7 research outputs found
Sharp analysis of low-rank kernel matrix approximations
We consider supervised learning problems within the positive-definite kernel
framework, such as kernel ridge regression, kernel logistic regression or the
support vector machine. With kernels leading to infinite-dimensional feature
spaces, a common practical limiting difficulty is the necessity of computing
the kernel matrix, which most frequently leads to algorithms with running time
at least quadratic in the number of observations n, i.e., O(n^2). Low-rank
approximations of the kernel matrix are often considered as they allow the
reduction of running time complexities to O(p^2 n), where p is the rank of the
approximation. The practicality of such methods thus depends on the required
rank p. In this paper, we show that in the context of kernel ridge regression,
for approximations based on a random subset of columns of the original kernel
matrix, the rank p may be chosen to be linear in the degrees of freedom
associated with the problem, a quantity which is classically used in the
statistical analysis of such methods, and is often seen as the implicit number
of parameters of non-parametric estimators. This result enables simple
algorithms that have sub-quadratic running time complexity, but provably
exhibit the same predictive performance than existing algorithms, for any given
problem instance, and not only for worst-case situations
Naive Bayes Data Complexity and Characterization of Optima of the Unsupervised Expected Likelihood
The naive Bayes model is a simple model that has been used for many decades, often as a baseline, for both supervised and unsupervised learning. With a latent class variable it is one of the simplest latent variable models, and is often used for clustering. The estimation of its parameters by maximum likelihood (e.g. using gradient ascent, expectation maximization) is subject to local optima since the objective is non-concave. However, the conditions under which global optimality can be guaranteed are currently unknown. I provide a first characterization of the optima of the na ̈ıve Bayes model. For problems with up to three features, I describe comprehensive conditions that ensure global optimality. For more than three features, I show that all stationary points exhibit marginal distributions with respect to the features that match those of the training data. In a second line of work, I consider the naive Bayes model with an observed class variable, which is often used for classification. Well known results provide some upper bounds on order of the sample complexity for agnostic PAC learning, however exact bounds are unknown. These bounds would show exactly how much data is needed for model training using a particular algorithm. I detail the framework for determining an exact tight bound on sample complexity, and prove some of the sub-theorems that this framework rests on. I also provide some insight into the nature of the distributions that are hardest to model within specified accuracy parameters