Significance and recovery of blocks structures in binary and real-valued matrices with noise

Abstract

Biclustering algorithms have been of recent interest in the field of Data Mining, particularly in the analysis of high dimensional data. Most biclustering problems can be stated in the following form: given a rectangular data matrix with real or categorical entries, find every submatrix satisfying a given criterion. In this dissertation, we study the statistical properties of several commonly used biclustering algorithms under appropriate random matrix models. For binary data, we establish a three-point concentration result, and several related probability bounds, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms that search only for submatrices of 1s cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. For data matrices with real-valued entries, we establish a concentration result for the size of the largest square submatrix with high average in a square Gaussian matrix. Probability upper bounds on the size of the largest non-square high average submatrix with a fixed row/column aspect ratio in a non-square real-valued matrix with fixed row/column aspect ratio are also established when the entries of the matrix follow appropriate distributions. For biclustering algorithms targeting submatrices with low ANOVA residuals, we show how to assess the significance of the resulting submatrices. Lastly, we study the recoverability of submatrices with high average under an additive Gaussian noise model

    Similar works