172 research outputs found
Model-based clustering using copulas with applications
The majority of model-based clustering techniques is based on multivariate normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: (i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and (ii) the explicit choice of marginal distributions for the clusters allows the modelling of multivariate data of various modes (either discrete or continuous) in a natural way. This paper introduces and studies the framework of copula-based finite mixture models for clustering applications. Estimation in the general case can be performed using standard EM, and, depending on the mode of the data, more efficient procedures are provided that can fully exploit the copula structure. The closure properties of the mixture models under marginalization are discussed, and for continuous, real-valued data parametric rotations in the sample space are introduced, with a parallel discussion on parameter identifiability depending on the choice of copulas for the components. The exposition of the methodology is accompanied and motivated by the analysis of real and artificial data
Bayesian Structural Learning with Parametric Marginals for Count Data: An Application to Microbiota Systems
High dimensional and heterogeneous count data are collected in various
applied fields. In this paper, we look closely at high-resolution sequencing
data on the microbiome, which have enabled researchers to study the genomes of
entire microbial communities. Revealing the underlying interactions between
these communities is of vital importance to learn how microbes influence human
health. To perform structural learning from multivariate count data such as
these, we develop a novel Gaussian copula graphical model with two key
elements. Firstly, we employ parametric regression to characterize the marginal
distributions. This step is crucial for accommodating the impact of external
covariates. Neglecting this adjustment could potentially introduce distortions
in the inference of the underlying network of dependences. Secondly, we advance
a Bayesian structure learning framework, based on a computationally efficient
search algorithm that is suited to high dimensionality. The approach returns
simultaneous inference of the marginal effects and of the dependence structure,
including graph uncertainty estimates. A simulation study and a real data
analysis of microbiome data highlight the applicability of the proposed
approach at inferring networks from multivariate count data in general, and its
relevance to microbiome analyses in particular. The proposed method is
implemented in the R package BDgraph
Risk Management Lessons from the Global Financial Crisis for Derivative Exchanges
During the global financial turmoil of 2007 and 2008, no major derivative clearing house in the world encountered distress while many banks were pushed to the brink and beyond. An important reason for this is that derivative exchanges have avoided using value at risk, normal distributions and linear correlations. This is an important lesson. The global financial crisis has also taught us that in risk management, robustness is more important than sophistication and that it is dangerous to use models that are over calibrated to short time series of market prices. The paper applies these lessons to the important exchange traded derivatives in India and recommends major changes to the current margining systems to improve their robustness. It also discusses directions in which global best practices in exchange risk management could be improved to take advantage of recent advances in computing power and finance theory. The paper argues that risk management should evolve towards explicit models based on coherent risk measures (like expected shortfall), fat tailed distributions and non linear dependence structures (copulas).
Cumulative Distribution Functions As The Foundation For Probabilistic Models
This thesis discusses applications of probabilistic and connectionist models for
constructing and training cumulative distribution functions (CDFs). First, it is shown
how existing tools from the copula literature can be combined to build probabilistic
models. It is found that this simple construction leads to numerical and scalability
issues that make training and inference challenging.
Next, several innovative ideas, combining neural networks, automatic differentiation
and copula functions, introduce how to assemble black-box probabilistic
models. The basic building block is a cumulative distribution function that is straightforward
to construct, composed of arithmetic operations and nonlinear functions.
There is no need to assume any specific parametric probability density function
(PDF), making the model flexible and normalisation unnecessary. The only requirement
is to design a computational graph that parameterises monotonically
non-decreasing functions with a constrained range. Training can be then performed
using standard tools from any neural network software library.
Finally, factorial hidden Markov models (FHMMs) for sequential data are
presented. It is shown how to leverage cumulative distribution functions in the
form of the Gaussian copula and amortised stochastic variational method to encode
hidden Markov chains coherently. This approach enables efficient learning and
inference to model long sequences of high-dimensional data with long-range dependencies.
Tackling such complex problems was impossible with the established
FHMM approximate inference algorithm.
It is empirically verified on several problems that some of the estimators introduced
in this work can perform comparably or better than the currently popular
models. Especially for tasks requiring tail-area or marginal probabilities that can be
read directly from a cumulative distribution function
Recommended from our members
Appropriate, accessible and appealing probabilistic graphical models
Appropriate - Many multivariate probabilistic models either use independent distributions or dependent Gaussian distributions. Yet, many real-world datasets contain count-valued or non-negative skewed data, e.g. bag-of-words text data and biological sequencing data. Thus, we develop novel probabilistic graphical models for use on count-valued and non-negative data including Poisson graphical models and multinomial graphical models. We develop one generalization that allows for triple-wise or k-wise graphical models going beyond the normal pairwise formulation. Furthermore, we also explore Gaussian-copula graphical models and derive closed-form solutions for the conditional distributions and marginal distributions (both before and after conditioning). Finally, we derive mixture and admixture, or topic model, generalizations of these graphical models to introduce more power and interpretability.
Accessible - Previous multivariate models, especially related to text data, often have complex dependencies without a closed form and require complex inference algorithms that have limited theoretical justification. For example, hierarchical Bayesian models often require marginalizing over many latent variables. We show that our novel graphical models (even the k-wise interaction models) have simple and intuitive estimation procedures based on node-wise regressions that likely have similar theoretical guarantees as previous work in graphical models. For the copula-based graphical models, we show that simple approximations could still provide useful models; these copula models also come with closed-form conditional and marginal distributions, which make them amenable to exploratory inspection and manipulation. The parameters of these models are easy to interpret and thus may be accessible to a wide audience.
Appealing - High-level visualization and interpretation of graphical models with even 100 variables has often been difficult even for a graphical model expert---despite visualization being one of the original motivators for graphical models. This difficulty is likely due to the lack of collaboration between graphical model experts and visualization experts. To begin bridging this gap, we develop a novel "what if?" interaction that manipulates and leverages the probabilistic power of graphical models. Our approach defines: the probabilistic mechanism via conditional probability; the query language to map text input to a conditional probability query; and the formal underlying probabilistic model. We then propose to visualize these query-specific probabilistic graphical models by combining the intuitiveness of force-directed layouts with the beauty and readability of word clouds, which pack many words into valuable screen space while ensuring words do not overlap via pixel-level collision detection. Although both the force-directed layout and the pixel-level packing problems are challenging in their own right, we approximate both simultaneously via adaptive simulated annealing starting from careful initialization. For visualizing mixture distributions, we also design a meaningful mapping from the properties of the mixture distribution to a color in the perceptually uniform CIELUV color space. Finally, we demonstrate our approach via illustrative visualizations of several real-world datasets.Computer Science
- …