723 research outputs found

    Foundational principles for large scale inference: Illustrations through correlation mining

    Full text link
    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

    Exponential Strong Converse for Successive Refinement with Causal Decoder Side Information

    Full text link
    We consider the kk-user successive refinement problem with causal decoder side information and derive an exponential strong converse theorem. The rate-distortion region for the problem can be derived as a straightforward extension of the two-user case by Maor and Merhav (2008). We show that for any rate-distortion tuple outside the rate-distortion region of the kk-user successive refinement problem with causal decoder side information, the joint excess-distortion probability approaches one exponentially fast. Our proof follows by judiciously adapting the recently proposed strong converse technique by Oohama using the information spectrum method, the variational form of the rate-distortion region and H\"older's inequality. The lossy source coding problem with causal decoder side information considered by El Gamal and Weissman is a special case (k=1k=1) of the current problem. Therefore, the exponential strong converse theorem for the El Gamal and Weissman problem follows as a corollary of our result

    On Measure Transformed Canonical Correlation Analysis

    Full text link
    In this paper linear canonical correlation analysis (LCCA) is generalized by applying a structured transform to the joint probability distribution of the considered pair of random vectors, i.e., a transformation of the joint probability measure defined on their joint observation space. This framework, called measure transformed canonical correlation analysis (MTCCA), applies LCCA to the data after transformation of the joint probability measure. We show that judicious choice of the transform leads to a modified canonical correlation analysis, which, in contrast to LCCA, is capable of detecting non-linear relationships between the considered pair of random vectors. Unlike kernel canonical correlation analysis, where the transformation is applied to the random vectors, in MTCCA the transformation is applied to their joint probability distribution. This results in performance advantages and reduced implementation complexity. The proposed approach is illustrated for graphical model selection in simulated data having non-linear dependencies, and for measuring long-term associations between companies traded in the NASDAQ and NYSE stock markets
    • …
    corecore