1,354 research outputs found

    Mining Pure, Strict Epistatic Interactions from High-Dimensional Datasets: Ameliorating the Curse of Dimensionality

    Get PDF
    Background: The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets. Methodology/Findings: A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer's dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects. Conclusions/Significance: We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets. © 2012 Jiang, Neapolitan

    Increment entropy as a measure of complexity for time series

    Full text link
    Entropy has been a common index to quantify the complexity of time series in a variety of fields. Here, we introduce increment entropy to measure the complexity of time series in which each increment is mapped into a word of two letters, one letter corresponding to direction and the other corresponding to magnitude. The Shannon entropy of the words is termed as increment entropy (IncrEn). Simulations on synthetic data and tests on epileptic EEG signals have demonstrated its ability of detecting the abrupt change, regardless of energetic (e.g. spikes or bursts) or structural changes. The computation of IncrEn does not make any assumption on time series and it can be applicable to arbitrary real-world data.Comment: 12pages,7figure,2 table

    Testing Serial Independence of Object-Valued Time Series

    Full text link
    We propose a novel method for testing serial independence of object-valued time series in metric spaces, which is more general than Euclidean or Hilbert spaces. The proposed method is fully nonparametric, free of tuning parameters, and can capture all nonlinear pairwise dependence. The key concept used in this paper is the distance covariance in metric spaces, which is extended to auto distance covariance for object-valued time series. Furthermore, we propose a generalized spectral density function to account for pairwise dependence at all lags and construct a Cramer-von Mises type test statistic. New theoretical arguments are developed to establish the asymptotic behavior of the test statistic. A wild bootstrap is also introduced to obtain the critical values of the non-pivotal limiting null distribution. Extensive numerical simulations and two real data applications are conducted to illustrate the effectiveness and versatility of our proposed method

    Two-Sample and Change-Point Inference for Non-Euclidean Valued Time Series

    Full text link
    Data objects taking value in a general metric space have become increasingly common in modern data analysis. In this paper, we study two important statistical inference problems, namely, two-sample testing and change-point detection, for such non-Euclidean data under temporal dependence. Typical examples of non-Euclidean valued time series include yearly mortality distributions, time-varying networks, and covariance matrix time series. To accommodate unknown temporal dependence, we advance the self-normalization (SN) technique (Shao, 2010) to the inference of non-Euclidean time series, which is substantially different from the existing SN-based inference for functional time series that reside in Hilbert space (Zhang et al., 2011). Theoretically, we propose new regularity conditions that could be easier to check than those in the recent literature, and derive the limiting distributions of the proposed test statistics under both null and local alternatives. For change-point detection problem, we also derive the consistency for the change-point location estimator, and combine our proposed change-point test with wild binary segmentation to perform multiple change-point estimation. Numerical simulations demonstrate the effectiveness and robustness of our proposed tests compared with existing methods in the literature. Finally, we apply our tests to two-sample inference in mortality data and change-point detection in cryptocurrency data

    Testing the martingale difference hypothesis in high dimension

    Full text link
    In this paper, we consider testing the martingale difference hypothesis for high-dimensional time series. Our test is built on the sum of squares of the element-wise max-norm of the proposed matrix-valued nonlinear dependence measure at different lags. To conduct the inference, we approximate the null distribution of our test statistic by Gaussian approximation and provide a simulation-based approach to generate critical values. The asymptotic behavior of the test statistic under the alternative is also studied. Our approach is nonparametric as the null hypothesis only assumes the time series concerned is martingale difference without specifying any parametric forms of its conditional moments. As an advantage of Gaussian approximation, our test is robust to the cross-series dependence of unknown magnitude. To the best of our knowledge, this is the first valid test for the martingale difference hypothesis that not only allows for large dimension but also captures nonlinear serial dependence. The practical usefulness of our test is illustrated via simulation and a real data analysis. The test is implemented in a user-friendly R-function
    • …
    corecore