88 research outputs found

    The Multivariate Watson Distribution: Maximum-Likelihood Estimation and other Aspects

    Full text link
    This paper studies fundamental aspects of modelling data using multivariate Watson distributions. Although these distributions are natural for modelling axially symmetric data (i.e., unit vectors where \pm \x are equivalent), for high-dimensions using them can be difficult. Why so? Largely because for Watson distributions even basic tasks such as maximum-likelihood are numerically challenging. To tackle the numerical difficulties some approximations have been derived---but these are either grossly inaccurate in high-dimensions (\emph{Directional Statistics}, Mardia & Jupp. 2000) or when reasonably accurate (\emph{J. Machine Learning Research, W. & C.P., v2}, Bijral \emph{et al.}, 2007, pp. 35--42), they lack theoretical justification. We derive new approximations to the maximum-likelihood estimates; our approximations are theoretically well-defined, numerically accurate, and easy to compute. We build on our parameter estimation and discuss mixture-modelling with Watson distributions; here we uncover a hitherto unknown connection to the "diametrical clustering" algorithm of Dhillon \emph{et al.} (\emph{Bioinformatics}, 19(13), 2003, pp. 1612--1619).Comment: 24 pages; extensively updated numerical result

    Clustering of categorical variables around latent variables

    Get PDF
    In the framework of clustering, the usual aim is to cluster observations and not variables. However the issue of variable clustering clearly appears for dimension reduction, selection of variables or in some case studies (sensory analysis, biochemistry, marketing, etc.). Clustering of variables is then studied as a way to arrange variables into homogeneous clusters, thereby organizing data into meaningful structures. Once the variables are clustered into groups such that variables are similar to the other variables belonging to their cluster, the selection of a subset of variables is possible. Several specific methods have been developed for the clustering of numerical variables. However concerning categorical variables, much less methods have been proposed. In this paper we extend the criterion used by Vigneau and Qannari (2003) in their Clustering around Latent Variables approach for numerical variables to the case of categorical data. The homogeneity criterion of a cluster of categorical variables is defined as the sum of the correlation ratio between the categorical variables and a latent variable, which is in this case a numerical variable. We show that the latent variable maximizing the homogeneity of a cluster can be obtained with Multiple Correspondence Analysis. Different algorithms for the clustering of categorical variables are proposed: iterative relocation algorithm, ascendant and divisive hierarchical clustering. The proposed methodology is illustrated by a real data application to satisfaction of pleasure craft operators.clustering of categorical variables, correlation ratio, iterative relocation algorithm, hierarchical clustering

    ClustOfVar: An R Package for the Clustering of Variables

    Get PDF
    Clustering of variables is as a way to arrange variables into homogeneous clusters, i.e., groups of variables which are strongly related to each other and thus bring the same information. These approaches can then be useful for dimension reduction and variable selection. Several specific methods have been developed for the clustering of numerical variables. However concerning qualitative variables or mixtures of quantitative and qualitative variables, far fewer methods have been proposed. The R package ClustOfVar was specifically developed for this purpose. The homogeneity criterion of a cluster is defined as the sum of correlation ratios (for qualitative variables) and squared correlations (for quantitative variables) to a synthetic quantitative variable, summarizing "as good as possible" the variables in the cluster. This synthetic variable is the first principal component obtained with the PCAMIX method. Two algorithms for the clustering of variables are proposed: iterative relocation algorithm and ascendant hierarchical clustering. We also propose a bootstrap approach in order to determine suitable numbers of clusters. We illustrate the methodologies and the associated package on small datasets

    Dimensionality reduction by clustering of variables while setting aside atypical variables

    Get PDF
    Clustering of variables is one possible approach for reducing the dimensionality of a dataset. However, all the variables are usually assigned to one of the clusters, even the scattered variables associated with atypical or noise information. The presence of this type of information could obscure the interpretation of the latent variables associated with the clusters, or even give rise to artificial clusters. We propose two strategies to address this problem. The first is a "K +1" strategy, which consists of introducing an additional group of variables,  called the "noise cluster" for simplicity. The second is based on the definition of sparse latent variables. Both strategies result in refined clusters for the identification of more relevant latent variables

    Systematic gene function prediction from gene expression data by using a fuzzy nearest-cluster method

    Get PDF
    BACKGROUND: Quantitative simultaneous monitoring of the expression levels of thousands of genes under various experimental conditions is now possible using microarray experiments. However, there are still gaps toward whole-genome functional annotation of genes using the gene expression data. RESULTS: In this paper, we propose a novel technique called Fuzzy Nearest Clusters for genome-wide functional annotation of unclassified genes. The technique consists of two steps: an initial hierarchical clustering step to detect homogeneous co-expressed gene subgroups or clusters in each possibly heterogeneous functional class; followed by a classification step to predict the functional roles of the unclassified genes based on their corresponding similarities to the detected functional clusters. CONCLUSION: Our experimental results with yeast gene expression data showed that the proposed method can accurately predict the genes' functions, even those with multiple functional roles, and the prediction performance is most independent of the underlying heterogeneity of the complex functional classes, as compared to the other conventional gene function prediction approaches

    Identification of the Proliferation/Differentiation Switch in the Cellular Network of Multicellular Organisms

    Get PDF
    The protein–protein interaction networks, or interactome networks, have been shown to have dynamic modular structures, yet the functional connections between and among the modules are less well understood. Here, using a new pipeline to integrate the interactome and the transcriptome, we identified a pair of transcriptionally anticorrelated modules, each consisting of hundreds of genes in multicellular interactome networks across different individuals and populations. The two modules are associated with cellular proliferation and differentiation, respectively. The proliferation module is conserved among eukaryotic organisms, whereas the differentiation module is specific to multicellular organisms. Upon differentiation of various tissues and cell lines from different organisms, the expression of the proliferation module is more uniformly suppressed, while the differentiation module is upregulated in a tissue- and species-specific manner. Our results indicate that even at the tissue and organism levels, proliferation and differentiation modules may correspond to two alternative states of the molecular network and may reflect a universal symbiotic relationship in a multicellular organism. Our analyses further predict that the proteins mediating the interactions between these modules may serve as modulators at the proliferation/differentiation switch

    Negative Correlation Aided Network Module Extraction

    Get PDF
    AbstractIn this paper, we propose a method to construct an unweighted co-expression network that considers both positive and negative correlation among gene expressions. A measure named NCNMRS(Negative Correlation aided Normalized Mean Residue Similarity) is introduced. The measure can detect both of these correlations and it is used to determine whether a pair of genes are highly correlated either in terms of positive correlation or negative correlation. A greedy technique is also proposed to extract modules from unweighted network. The technique picks a pair of genes with next highest NCNMRS score at a time such that none of the genes in the pair has been included in any network module extracted so far and extends this partial module to a complete network module including genes with high connectivity into the partial module. The technique was applied on a number of real life gene expression datasets and the results have high biological relevance
    corecore