4,727 research outputs found
Clustering and cluster inference of complex data structures
Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modelling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures.
This thesis provides new modality theorems and important analytical results on the upper bound of the number of modes for multivariate t-mixtures and compares them with existing results on normal mixtures. Graphical tools for merging t-mixtures and the effect of degrees-of-freedom are also thoroughly examined.
The most important contribution of this thesis are a set of fundamental results on the modality of skewed normal and skewed normal mixtures. First, we show that the topography of high-dimensional skew normal mixtures can be analyzed rigorously in lower dimensions by defining the corresponding ridgeline manifold that contains all critical points, as well as the ridges of the density. But unlike the normal or t-mixtures we need to solve an implicit equation to obtain this manifold. The plot of the elevations on the ridgeline can still be used to develop tools to explore the number of modes and for merging mixture components. Though analytical results on the number of modes cannot be explored any more, the elevation plots lead to a new conjecture on the upper bound on the number of modes of skew normal mixture.
Unlike the normal and t-distribution, for skew normal distributions even the one-component counterpart have very interesting modal features. Firstly, as the modes cannot be written in closed form, we design and provide software tools to calculate the modes in any dimensions. We also provide a thorough study exploring the relationship between the means and modes of skew normals and provide fundamental results on the limiting behaviour of the mean and mode as the skewness parameter increases. We also provide another new result showing that though the mean can vary widely as the skewness parameter varies, the mode is a much more robust measure of the central tendency as the mode of skew distribution only varies within a smaller range.
Two R-package available on github containing the numerical tools for calculating the modes of skew normals and function specific to merging of skew normal components is provided as part of this thesis. Additionally, application of the merging tool developed of skew normal mixtures is demonstrated using flow-cytomtery data
Flexible modelling in statistics: past, present and future
In times where more and more data become available and where the data exhibit
rather complex structures (significant departure from symmetry, heavy or light
tails), flexible modelling has become an essential task for statisticians as
well as researchers and practitioners from domains such as economics, finance
or environmental sciences. This is reflected by the wealth of existing
proposals for flexible distributions; well-known examples are Azzalini's
skew-normal, Tukey's -and-, mixture and two-piece distributions, to cite
but these. My aim in the present paper is to provide an introduction to this
research field, intended to be useful both for novices and professionals of the
domain. After a description of the research stream itself, I will narrate the
gripping history of flexible modelling, starring emblematic heroes from the
past such as Edgeworth and Pearson, then depict three of the most used flexible
families of distributions, and finally provide an outlook on future flexible
modelling research by posing challenging open questions.Comment: 27 pages, 4 figure
EMMIXcskew: an R Package for the Fitting of a Mixture of Canonical Fundamental Skew t-Distributions
This paper presents an R package EMMIXcskew for the fitting of the canonical
fundamental skew t-distribution (CFUST) and finite mixtures of this
distribution (FM-CFUST) via maximum likelihood (ML). The CFUST distribution
provides a flexible family of models to handle non-normal data, with parameters
for capturing skewness and heavy-tails in the data. It formally encompasses the
normal, t, and skew-normal distributions as special and/or limiting cases. A
few other versions of the skew t-distributions are also nested within the CFUST
distribution. In this paper, an Expectation-Maximization (EM) algorithm is
described for computing the ML estimates of the parameters of the FM-CFUST
model, and different strategies for initializing the algorithm are discussed
and illustrated. The methodology is implemented in the EMMIXcskew package, and
examples are presented using two real datasets. The EMMIXcskew package contains
functions to fit the FM-CFUST model, including procedures for generating
different initial values. Additional features include random sample generation
and contour visualization in 2D and 3D
Bayesian modelling of skewness and kurtosis with two-piece scale and shape distributions
We formalise and generalise the definition of the family of univariate double
two--piece distributions, obtained by using a density--based transformation of
unimodal symmetric continuous distributions with a shape parameter. The
resulting distributions contain five interpretable parameters that control the
mode, as well as the scale and shape in each direction. Four-parameter
subfamilies of this class of distributions that capture different types of
asymmetry are discussed. We propose interpretable scale and location-invariant
benchmark priors and derive conditions for the propriety of the corresponding
posterior distribution. The prior structures used allow for meaningful
comparisons through Bayes factors within flexible families of distributions.
These distributions are applied to data from finance, internet traffic and
medicine, comparing them with appropriate competitors
- …