28,202 research outputs found
Detecting spatial patterns with the cumulant function. Part I: The theory
In climate studies, detecting spatial patterns that largely deviate from the
sample mean still remains a statistical challenge. Although a Principal
Component Analysis (PCA), or equivalently a Empirical Orthogonal Functions
(EOF) decomposition, is often applied on this purpose, it can only provide
meaningful results if the underlying multivariate distribution is Gaussian.
Indeed, PCA is based on optimizing second order moments quantities and the
covariance matrix can only capture the full dependence structure for
multivariate Gaussian vectors. Whenever the application at hand can not satisfy
this normality hypothesis (e.g. precipitation data), alternatives and/or
improvements to PCA have to be developed and studied. To go beyond this second
order statistics constraint that limits the applicability of the PCA, we take
advantage of the cumulant function that can produce higher order moments
information. This cumulant function, well-known in the statistical literature,
allows us to propose a new, simple and fast procedure to identify spatial
patterns for non-Gaussian data. Our algorithm consists in maximizing the
cumulant function. To illustrate our approach, its implementation for which
explicit computations are obtained is performed on three family of of
multivariate random vectors. In addition, we show that our algorithm
corresponds to selecting the directions along which projected data display the
largest spread over the marginal probability density tails.Comment: 9 pages, 3 figure
Robust Orthogonal Complement Principal Component Analysis
Recently, the robustification of principal component analysis has attracted
lots of attention from statisticians, engineers and computer scientists. In
this work we study the type of outliers that are not necessarily apparent in
the original observation space but can seriously affect the principal subspace
estimation. Based on a mathematical formulation of such transformed outliers, a
novel robust orthogonal complement principal component analysis (ROC-PCA) is
proposed. The framework combines the popular sparsity-enforcing and low rank
regularization techniques to deal with row-wise outliers as well as
element-wise outliers. A non-asymptotic oracle inequality guarantees the
accuracy and high breakdown performance of ROC-PCA in finite samples. To tackle
the computational challenges, an efficient algorithm is developed on the basis
of Stiefel manifold optimization and iterative thresholding. Furthermore, a
batch variant is proposed to significantly reduce the cost in ultra high
dimensions. The paper also points out a pitfall of a common practice of SVD
reduction in robust PCA. Experiments show the effectiveness and efficiency of
ROC-PCA in both synthetic and real data
Statistical inference with anchored Bayesian mixture of regressions models: A case study analysis of allometric data
We present a case study in which we use a mixture of regressions model to
improve on an ill-fitting simple linear regression model relating log brain
mass to log body mass for 100 placental mammalian species. The slope of this
regression model is of particular scientific interest because it corresponds to
a constant that governs a hypothesized allometric power law relating brain mass
to body mass. A specific line of investigation is to determine whether the
regression parameters vary across subgroups of related species.
We model these data using an anchored Bayesian mixture of regressions model,
which modifies the standard Bayesian Gaussian mixture by pre-assigning small
subsets of observations to given mixture components with probability one. These
observations (called anchor points) break the relabeling invariance typical of
exchangeable model specifications (the so-called label-switching problem). A
careful choice of which observations to pre-classify to which mixture
components is key to the specification of a well-fitting anchor model.
In the article we compare three strategies for the selection of anchor
points. The first assumes that the underlying mixture of regressions model
holds and assigns anchor points to different components to maximize the
information about their labeling. The second makes no assumption about the
relationship between x and y and instead identifies anchor points using a
bivariate Gaussian mixture model. The third strategy begins with the assumption
that there is only one mixture regression component and identifies anchor
points that are representative of a clustering structure based on case-deletion
importance sampling weights. We compare the performance of the three strategies
on the allometric data set and use auxiliary taxonomic information about the
species to evaluate the model-based classifications estimated from these
models
Detecting spatial patterns with the cumulant function. Part II: An application to El Nino
The spatial coherence of a measured variable (e.g. temperature or pressure)
is often studied to determine the regions where this variable varies the most
or to find teleconnections, i.e. correlations between specific regions. While
usual methods to find spatial patterns, such as Principal Components Analysis
(PCA), are constrained by linear symmetries, the dependence of variables such
as temperature or pressure at different locations is generally nonlinear. In
particular, large deviations from the sample mean are expected to be strongly
affected by such nonlinearities. Here we apply a newly developed nonlinear
technique (Maxima of Cumulant Function, MCF) for the detection of typical
spatial patterns that largely deviate from the mean. In order to test the
technique and to introduce the methodology, we focus on the El Nino/Southern
Oscillation and its spatial patterns. We find nonsymmetric temperature patterns
corresponding to El Nino and La Nina, and we compare the results of MCF with
other techniques, such as the symmetric solutions of PCA, and the nonsymmetric
solutions of Nonlinear PCA (NLPCA). We found that MCF solutions are more
reliable than the NLPCA fits, and can capture mixtures of principal components.
Finally, we apply Extreme Value Theory on the temporal variations extracted
from our methodology. We find that the tails of the distribution of extreme
temperatures during La Nina episodes is bounded, while the tail during El Ninos
is less likely to be bounded. This implies that the mean spatial patterns of
the two phases are asymmetric, as well as the behaviour of their extremes.Comment: 15 pages, 7 figure
What are the Best Hierarchical Descriptors for Complex Networks?
This work reviews several hierarchical measurements of the topology of
complex networks and then applies feature selection concepts and methods in
order to quantify the relative importance of each measurement with respect to
the discrimination between four representative theoretical network models,
namely Erd\"{o}s-R\'enyi, Barab\'asi-Albert, Watts-Strogatz as well as a
geographical type of network. The obtained results confirmed that the four
models can be well-separated by using a combination of measurements. In
addition, the relative contribution of each considered feature for the overall
discrimination of the models was quantified in terms of the respective weights
in the canonical projection into two dimensions, with the traditional
clustering coefficient, hierarchical clustering coefficient and neighborhood
clustering coefficient resulting particularly effective. Interestingly, the
average shortest path length and hierarchical node degrees contributed little
for the separation of the four network models.Comment: 9 pages, 4 figure
- …