3,077 research outputs found
Finding Exogenous Variables in Data with Many More Variables than Observations
Many statistical methods have been proposed to estimate causal models in
classical situations with fewer variables than observations (p<n, p: the number
of variables and n: the number of observations). However, modern datasets
including gene expression data need high-dimensional causal modeling in
challenging situations with orders of magnitude more variables than
observations (p>>n). In this paper, we propose a method to find exogenous
variables in a linear non-Gaussian causal model, which requires much smaller
sample sizes than conventional methods and works even when p>>n. The key idea
is to identify which variables are exogenous based on non-Gaussianity instead
of estimating the entire structure of the model. Exogenous variables work as
triggers that activate a causal chain in the model, and their identification
leads to more efficient experimental designs and better understanding of the
causal mechanism. We present experiments with artificial data and real-world
gene expression data to evaluate the method.Comment: A revised version of this was published in Proc. ICANN201
Quantifying identifiability in independent component analysis
We are interested in consistent estimation of the mixing matrix in the ICA
model, when the error distribution is close to (but different from) Gaussian.
In particular, we consider independent samples from the ICA model , where we assume that the coordinates of are independent
and identically distributed according to a contaminated Gaussian distribution,
and the amount of contamination is allowed to depend on . We then
investigate how the ability to consistently estimate the mixing matrix depends
on the amount of contamination. Our results suggest that in an asymptotic
sense, if the amount of contamination decreases at rate or faster,
then the mixing matrix is only identifiable up to transpose products. These
results also have implications for causal inference from linear structural
equation models with near-Gaussian additive noise.Comment: 22 pages, 2 figure
Modeling sparse connectivity between underlying brain sources for EEG/MEG
We propose a novel technique to assess functional brain connectivity in
EEG/MEG signals. Our method, called Sparsely-Connected Sources Analysis (SCSA),
can overcome the problem of volume conduction by modeling neural data
innovatively with the following ingredients: (a) the EEG is assumed to be a
linear mixture of correlated sources following a multivariate autoregressive
(MVAR) model, (b) the demixing is estimated jointly with the source MVAR
parameters, (c) overfitting is avoided by using the Group Lasso penalty. This
approach allows to extract the appropriate level cross-talk between the
extracted sources and in this manner we obtain a sparse data-driven model of
functional connectivity. We demonstrate the usefulness of SCSA with simulated
data, and compare to a number of existing algorithms with excellent results.Comment: 9 pages, 6 figure
We Are Not Your Real Parents: Telling Causal from Confounded using MDL
Given data over variables we consider the problem of finding out whether jointly causes or whether they are all confounded by an unobserved latent variable . To do so, we take an information-theoretic approach based on Kolmogorov complexity. In a nutshell, we follow the postulate that first encoding the true cause, and then the effects given that cause, results in a shorter description than any other encoding of the observed variables. The ideal score is not computable, and hence we have to approximate it. We propose to do so using the Minimum Description Length (MDL) principle. We compare the MDL scores under the models where causes and where there exists a latent variables confounding both and and show our scores are consistent. To find potential confounders we propose using latent factor modeling, in particular, probabilistic PCA (PPCA). Empirical evaluation on both synthetic and real-world data shows that our method, CoCa, performs very well -- even when the true generating process of the data is far from the assumptions made by the models we use. Moreover, it is robust as its accuracy goes hand in hand with its confidence
- …