Search CORE

5,305 research outputs found

Foundational principles for large scale inference: Illustrations through correlation mining

Author: Alfred O. Hero
Alfred O. Hero
Alfred O. Hero
Bala Rajaratnam
Bala Rajaratnam
Bala Rajaratnam
Publication venue
Publication date: 18/05/2015
Field of study

When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number

n

of acquired samples (statistical replicates) is far fewer than the number

p

of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size

n

is fixed, and the dimension

p

grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

arXiv.org e-Print Archive

CiteSeerX

PubMed Central

eScholarship - University of California

Scalable sparse covariance estimation via self-concordance

Author: Cevher Volkan
Kyrillidis Anastasios
Mahabadi Rabeeh Karimi
Tran-Dinh Quoc
Publication venue
Publication date: 05/05/2014
Field of study

We consider the class of convex minimization problems, composed of a self-concordant function, such as the

\log\det

metric, a convex data fidelity term

h(\cdot)

and, a regularizing -- possibly non-smooth -- function

g(\cdot)

. This type of problems have recently attracted a great deal of interest, mainly due to their omnipresence in top-notch applications. Under this \emph{locally} Lipschitz continuous gradient setting, we analyze the convergence behavior of proximal Newton schemes with the added twist of a probable presence of inexact evaluations. We prove attractive convergence rate guarantees and enhance state-of-the-art optimization schemes to accommodate such developments. Experimental results on sparse covariance estimation show the merits of our algorithm, both in terms of recovery efficiency and complexity.Comment: 7 pages, 1 figure, Accepted at AAAI-1

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Association for the Advancement of Artificial Intelligence: AAAI Publications

The Gaussian rank correlation estimator: Robustness properties.

Author: Boudt Kris
Cornelissen Jonathan
Croux Christophe
Publication venue
Publication date
Field of study

The Gaussian rank correlation equals the usual correlation coefficient computed from the normal scores of the data. Although its influence function is unbounded, it still has attractive robustness properties. In particular, its breakdown point is above 12%. Moreover, the estimator is consistent and asymptotically efficient at the normal distribution. The correlation matrix based on the Gaussian rank correlation is always positive semidefinite, and very easy to compute, also in high dimensions. A simulation study confirms the good efficiency and robustness properties of the proposed estimator with respect to the popular Kendall and Spearman correlation measures. In the empirical application, we show how it can be used for multivariate outlier detection based on robust principal component analysis.Breakdown; Correlation; Efficiency; Robustness; Van der Waerden;

Research Papers in Economics

Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data

Author: Banerjee O.
Berndt D. J.
Boyd S.
Cover T. M.
Cuturi M.
Das G.
Gray R. M.
Hsieh C.-J.
Hsieh C.-J.
Lauritzen S. L.
Mohan K.
Smyth P.
Wytock M.
Publication venue
Publication date: 14/05/2018
Field of study

Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.Comment: This revised version fixes two small typos in the published versio

arXiv.org e-Print Archive

Crossref