13,587 research outputs found
Exact Dimensionality Selection for Bayesian PCA
We present a Bayesian model selection approach to estimate the intrinsic
dimensionality of a high-dimensional dataset. To this end, we introduce a novel
formulation of the probabilisitic principal component analysis model based on a
normal-gamma prior distribution. In this context, we exhibit a closed-form
expression of the marginal likelihood which allows to infer an optimal number
of components. We also propose a heuristic based on the expected shape of the
marginal likelihood curve in order to choose the hyperparameters. In
non-asymptotic frameworks, we show on simulated data that this exact
dimensionality selection approach is competitive with both Bayesian and
frequentist state-of-the-art methods
Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood
We discuss the problem of estimating the number of principal components in
Principal Com- ponents Analysis (PCA). Despite of the importance of the problem
and the multitude of solutions proposed in the literature, it comes as a
surprise that there does not exist a coherent asymptotic framework which would
justify different approaches depending on the actual size of the data set. In
this paper we address this issue by presenting an approximate Bayesian approach
based on Laplace approximation and introducing a general method for building
the model selection criteria, called PEnalized SEmi-integrated Likelihood
(PESEL). Our general framework encompasses a variety of existing approaches
based on probabilistic models, like e.g. Bayesian Information Criterion for the
Probabilistic PCA (PPCA), and allows for construction of new criteria,
depending on the size of the data set at hand. Specifically, we define PESEL
when the number of variables substantially exceeds the number of observations.
We also report results of extensive simulation studies and real data analysis,
which illustrate good properties of our proposed criteria as compared to the
state-of- the-art methods and very recent proposals. Specifially, these
simulations show that PESEL based criteria can be quite robust against
deviations from the probabilistic model assumptions. Selected PESEL based
criteria for the estimation of the number of principal components are
implemented in R package varclust, which is available on github
(https://github.com/psobczyk/varclust).Comment: 31 pages, 7 figure
A group model for stable multi-subject ICA on fMRI datasets
Spatial Independent Component Analysis (ICA) is an increasingly used
data-driven method to analyze functional Magnetic Resonance Imaging (fMRI)
data. To date, it has been used to extract sets of mutually correlated brain
regions without prior information on the time course of these regions. Some of
these sets of regions, interpreted as functional networks, have recently been
used to provide markers of brain diseases and open the road to paradigm-free
population comparisons. Such group studies raise the question of modeling
subject variability within ICA: how can the patterns representative of a group
be modeled and estimated via ICA for reliable inter-group comparisons? In this
paper, we propose a hierarchical model for patterns in multi-subject fMRI
datasets, akin to mixed-effect group models used in linear-model-based
analysis. We introduce an estimation procedure, CanICA (Canonical ICA), based
on i) probabilistic dimension reduction of the individual data, ii) canonical
correlation analysis to identify a data subspace common to the group iii)
ICA-based pattern extraction. In addition, we introduce a procedure based on
cross-validation to quantify the stability of ICA patterns at the level of the
group. We compare our method with state-of-the-art multi-subject fMRI ICA
methods and show that the features extracted using our procedure are more
reproducible at the group level on two datasets of 12 healthy controls: a
resting-state and a functional localizer study
Determining Principal Component Cardinality through the Principle of Minimum Description Length
PCA (Principal Component Analysis) and its variants areubiquitous techniques
for matrix dimension reduction and reduced-dimensionlatent-factor extraction.
One significant challenge in using PCA, is thechoice of the number of principal
components. The information-theoreticMDL (Minimum Description Length) principle
gives objective compression-based criteria for model selection, but it is
difficult to analytically applyits modern definition - NML (Normalized Maximum
Likelihood) - to theproblem of PCA. This work shows a general reduction of NML
prob-lems to lower-dimension problems. Applying this reduction, it boundsthe
NML of PCA, by terms of the NML of linear regression, which areknown.Comment: LOD 201
Probabilistic classification of acute myocardial infarction from multiple cardiac markers
Logistic regression and Gaussian mixture model (GMM) classifiers have been trained to estimate the probability of acute myocardial infarction (AMI) in patients based upon the concentrations of a panel of cardiac markers. The panel consists of two new markers, fatty acid binding protein (FABP) and glycogen phosphorylase BB (GPBB), in addition to the traditional cardiac troponin I (cTnI), creatine kinase MB (CKMB) and myoglobin. The effect of using principal component analysis (PCA) and Fisher discriminant analysis (FDA) to preprocess the marker concentrations was also investigated. The need for classifiers to give an accurate estimate of the probability of AMI is argued and three categories of performance measure are described, namely discriminatory ability, sharpness, and reliability. Numerical performance measures for each category are given and applied. The optimum classifier, based solely upon the samples take on admission, was the logistic regression classifier using FDA preprocessing. This gave an accuracy of 0.85 (95% confidence interval: 0.78–0.91) and a normalised Brier score of 0.89. When samples at both admission and a further time, 1–6 h later, were included, the performance increased significantly, showing that logistic regression classifiers can indeed use the information from the five cardiac markers to accurately and reliably estimate the probability AMI
- …