55,512 research outputs found

    Feature-to-feature regression for a two-step conditional independence test

    No full text
    The algorithms for causal discovery and more broadly for learning the structure of graphical models require well calibrated and consistent conditional independence (CI) tests. We revisit the CI tests which are based on two-step procedures and involve regression with subsequent (unconditional) independence test (RESIT) on regression residuals and investigate the assumptions under which these tests operate. In particular, we demonstrate that when going beyond simple functional relationships with additive noise, such tests can lead to an inflated number of false discoveries. We study the relationship of these tests with those based on dependence measures using reproducing kernel Hilbert spaces (RKHS) and propose an extension of RESIT which uses RKHS-valued regression. The resulting test inherits the simple two-step testing procedure of RESIT, while giving correct Type I control and competitive power. When used as a component of the PC algorithm, the proposed test is more robust to the case where hidden variables induce a switching behaviour in the associations present in the data

    Massively-Parallel Feature Selection for Big Data

    Full text link
    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of pp-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

    Concepts and a case study for a flexible class of graphical Markov models

    Full text link
    With graphical Markov models, one can investigate complex dependences, summarize some results of statistical analyses with graphs and use these graphs to understand implications of well-fitting models. The models have a rich history and form an area that has been intensively studied and developed in recent years. We give a brief review of the main concepts and describe in more detail a flexible subclass of models, called traceable regressions. These are sequences of joint response regressions for which regression graphs permit one to trace and thereby understand pathways of dependence. We use these methods to reanalyze and interpret data from a prospective study of child development, now known as the Mannheim Study of Children at Risk. The two related primary features concern cognitive and motor development, at the age of 4.5 and 8 years of a child. Deficits in these features form a sequence of joint responses. Several possible risks are assessed at birth of the child and when the child reached age 3 months and 2 years.Comment: 21 pages, 7 figures, 7 tables; invited, refereed chapter in a boo

    The conditional permutation test for independence while controlling for confounders

    Get PDF
    We propose a general new method, the conditional permutation test, for testing the conditional independence of variables XX and YY given a potentially high-dimensional random vector ZZ that may contain confounding factors. The proposed test permutes entries of XX non-uniformly, so as to respect the existing dependence between XX and ZZ and thus account for the presence of these confounders. Like the conditional randomization test of Cand\`es et al. (2018), our test relies on the availability of an approximation to the distribution of X∣ZX \mid Z. While Cand\`es et al. (2018)'s test uses this estimate to draw new XX values, for our test we use this approximation to design an appropriate non-uniform distribution on permutations of the XX values already seen in the true data. We provide an efficient Markov Chain Monte Carlo sampler for the implementation of our method, and establish bounds on the Type I error in terms of the error in the approximation of the conditional distribution of X∣ZX\mid Z, finding that, for the worst case test statistic, the inflation in Type I error of the conditional permutation test is no larger than that of the conditional randomization test. We validate these theoretical results with experiments on simulated data and on the Capital Bikeshare data set.Comment: 31 pages, 4 figure
    • …
    corecore