55,519 research outputs found
Feature-to-feature regression for a two-step conditional independence test
The algorithms for causal discovery and more broadly for learning the structure of graphical models require well calibrated and consistent conditional independence (CI) tests. We revisit the CI tests which are based on two-step procedures and involve regression with subsequent (unconditional) independence test (RESIT) on regression residuals and investigate the assumptions under which these tests operate. In particular, we demonstrate that when going beyond simple functional relationships with additive noise, such tests can lead to an inflated number of false discoveries. We study the relationship of these tests with those based on dependence measures using reproducing kernel Hilbert spaces (RKHS) and propose an extension of RESIT which uses RKHS-valued regression. The resulting test inherits the simple two-step testing procedure of RESIT, while giving correct Type I control and competitive power. When used as a component of the PC algorithm, the proposed test is more robust to the case where hidden variables induce a switching behaviour in the associations present in the data
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Concepts and a case study for a flexible class of graphical Markov models
With graphical Markov models, one can investigate complex dependences,
summarize some results of statistical analyses with graphs and use these graphs
to understand implications of well-fitting models. The models have a rich
history and form an area that has been intensively studied and developed in
recent years. We give a brief review of the main concepts and describe in more
detail a flexible subclass of models, called traceable regressions. These are
sequences of joint response regressions for which regression graphs permit one
to trace and thereby understand pathways of dependence. We use these methods to
reanalyze and interpret data from a prospective study of child development, now
known as the Mannheim Study of Children at Risk. The two related primary
features concern cognitive and motor development, at the age of 4.5 and 8 years
of a child. Deficits in these features form a sequence of joint responses.
Several possible risks are assessed at birth of the child and when the child
reached age 3 months and 2 years.Comment: 21 pages, 7 figures, 7 tables; invited, refereed chapter in a boo
The conditional permutation test for independence while controlling for confounders
We propose a general new method, the conditional permutation test, for
testing the conditional independence of variables and given a
potentially high-dimensional random vector that may contain confounding
factors. The proposed test permutes entries of non-uniformly, so as to
respect the existing dependence between and and thus account for the
presence of these confounders. Like the conditional randomization test of
Cand\`es et al. (2018), our test relies on the availability of an approximation
to the distribution of . While Cand\`es et al. (2018)'s test uses
this estimate to draw new values, for our test we use this approximation to
design an appropriate non-uniform distribution on permutations of the
values already seen in the true data. We provide an efficient Markov Chain
Monte Carlo sampler for the implementation of our method, and establish bounds
on the Type I error in terms of the error in the approximation of the
conditional distribution of , finding that, for the worst case test
statistic, the inflation in Type I error of the conditional permutation test is
no larger than that of the conditional randomization test. We validate these
theoretical results with experiments on simulated data and on the Capital
Bikeshare data set.Comment: 31 pages, 4 figure
- …