318 research outputs found

    Bilinear Probabilistic Canonical Correlation Analysis via Hybrid Concatenations

    Get PDF
    Canonical Correlation Analysis (CCA) is a classical technique for two-view correlation analysis, while Probabilistic CCA (PCCA) provides a generative and more general viewpoint for this task. Recently, PCCA has been extended to bilinear cases for dealing with two-view matrices in order to preserve and exploit the matrix structures in PCCA. However, existing bilinear PCCAs impose restrictive model assumptions for matrix structure preservation, sacrificing generative correctness or model flexibility. To overcome these drawbacks, we propose BPCCA, a new bilinear extension of PCCA, by introducing a hybrid joint model. Our new model preserves matrix structures indirectly via hybrid vector-based and matrix-based concatenations. This enables BPCCA to gain more model flexibility in capturing two-view correlations and obtain close-form solutions in parameter estimation. Experimental results on two real-world applications demonstrate the superior performance of BPCCA over competing methods

    Using SVD on Clusters to Improve Precision of Interdocument Similarity Measure

    Get PDF
    Recently, LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition) is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, SVD on clusters is proposed to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and non-SVD based methods. Secondly, we propose SVD on clusters for LSI and theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. Moreover, we develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods. Experiments demonstrate that, to some extent, SVD on clusters can improve the precision of interdocument similarity measure in comparison with other SVD based LSI methods

    Significance and recovery of blocks structures in binary and real-valued matrices with noise

    Get PDF
    Biclustering algorithms have been of recent interest in the field of Data Mining, particularly in the analysis of high dimensional data. Most biclustering problems can be stated in the following form: given a rectangular data matrix with real or categorical entries, find every submatrix satisfying a given criterion. In this dissertation, we study the statistical properties of several commonly used biclustering algorithms under appropriate random matrix models. For binary data, we establish a three-point concentration result, and several related probability bounds, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms that search only for submatrices of 1s cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. For data matrices with real-valued entries, we establish a concentration result for the size of the largest square submatrix with high average in a square Gaussian matrix. Probability upper bounds on the size of the largest non-square high average submatrix with a fixed row/column aspect ratio in a non-square real-valued matrix with fixed row/column aspect ratio are also established when the entries of the matrix follow appropriate distributions. For biclustering algorithms targeting submatrices with low ANOVA residuals, we show how to assess the significance of the resulting submatrices. Lastly, we study the recoverability of submatrices with high average under an additive Gaussian noise model

    Biclustering on expression data: A review

    Get PDF
    Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.Ministerio de Economía y Competitividad TIN2011-2895
    corecore