thesis

Copula models in machine learning

Abstract

The introduction of copulas, which allow separating the dependence structure of a multivariate distribution from its marginal behaviour, was a major advance in dependence modelling. Copulas brought new theoretical insights to the concept of dependence and enabled the construction of a variety of new multivariate distributions. Despite their popularity in statistics and financial modelling, copulas have remained largely unknown in the machine learning community until recently. This thesis investigates the use of copula models, in particular Gaussian copulas, for solving various machine learning problems and makes contributions in the domains of dependence detection between datasets, compression based on side information, and variable selection. Our first contribution is the introduction of a copula mixture model to perform dependency-seeking clustering for co-occurring samples from different data sources. The model takes advantage of the great flexibility offered by the copula framework to extend mixtures of Canonical Correlation Analyzers to multivariate data with arbitrary continuous marginal densities. We formulate our model as a non-parametric Bayesian mixture and provide an efficient Markov Chain Monte Carlo inference algorithm for it. Experiments on real and synthetic data demonstrate that the increased flexibility of the copula mixture significantly improves the quality of the clustering and the interpretability of the results. The second contribution is a reformulation of the information bottleneck (IB) problem in terms of a copula, using the equivalence between mutual information and negative copula entropy. Focusing on the Gaussian copula, we extend the analytical IB solution available for the multivariate Gaussian case to meta-Gaussian distributions which retain a Gaussian dependence structure but allow arbitrary marginal densities. The resulting approach extends the range of applicability of IB to non-Gaussian continuous data and is less sensitive to outliers than the original IB formulation. Our third and final contribution is the development of a novel sparse compression technique based on the information bottleneck (IB) principle, which takes into account side information. We achieve this by introducing a sparse variant of IB that compresses the data by preserving the information in only a few selected input dimensions. By assuming a Gaussian copula we can capture arbitrary non-Gaussian marginals, continuous or discrete. We use our model to select a subset of biomarkers relevant to the evolution of malignant melanoma and show that our sparse selection provides reliable predictors

    Similar works