43,544 research outputs found

    Phase transitions in spiked matrix estimation: information-theoretic analysis

    Full text link
    We study here the so-called spiked Wigner and Wishart models, where one observes a low-rank matrix perturbed by some Gaussian noise. These models encompass many classical statistical tasks such as sparse PCA, submatrix localization, community detection or Gaussian mixture clustering. The goal of these notes is to present in a unified manner recent results (as well as new developments) on the information-theoretic limits of these spiked matrix models. We compute the minimal mean squared error for the estimation of the low-rank signal and compare it to the performance of spectral estimators and message passing algorithms. Phase transition phenomena are observed: depending on the noise level it is either impossible, easy (i.e. using polynomial-time estimators) or hard (information-theoretically possible, but no efficient algorithm is known to succeed) to recover the signal.Comment: These notes present in a unified manner recent results (as well as new developments) on the information-theoretic limits in spiked matrix estimatio

    Empirical analysis of rough set categorical clustering techniques based on rough purity and value set

    Get PDF
    Clustering a set of objects into homogeneous groups is a fundamental operation in data mining. Recently, attention has been put on categorical data clustering, where data objects are made up of non-numerical attributes. The implementation of several existing categorical clustering techniques is challenging as some are unable to handle uncertainty and others have stability issues. In the process of dealing with categorical data and handling uncertainty, the rough set theory has become well-established mechanism in a wide variety of applications including databases. The recent techniques such as Information-Theoretic Dependency Roughness (ITDR), Maximum Dependency Attribute (MDA) and Maximum Significance Attribute (MSA) outperformed their predecessor approaches like Bi-Clustering (BC), Total Roughness (TR), Min-Min Roughness (MMR), and standard-deviation roughness (SDR). This work explores the limitations and issues of ITDR, MDA and MSA techniques on data sets where these techniques fails to select or faces difficulty in selecting their best clustering attribute. Accordingly, two alternative techniques named Rough Purity Approach (RPA) and Maximum Value Attribute (MVA) are proposed. The novelty of both proposed approaches is that, the RPA presents a new uncertainty definition based on purity of rough relational data base whereas, the MVA unlike other rough set theory techniques uses the domain knowledge such as value set combined with number of clusters (NoC). To show the significance, mathematical and theoretical basis for proposed approaches, several propositions are illustrated. Moreover, the recent rough categorical techniques like MDA, MSA, ITDR and classical clustering technique like simple K-mean are used for comparison and the results are presented in tabular and graphical forms. For experiments, data sets from previously utilized research cases, a real supply base management (SBM) data set and UCI repository are utilized. The results reveal significant improvement by proposed techniques for categorical clustering in terms of purity (21%), entropy (9%), accuracy (16%), rough accuracy (11%), iterations (99%) and time (93%). vi

    Visualising the structure of document search results: A comparison of graph theoretic approaches

    Get PDF
    This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or ‘spatialisation’ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or ‘cluster growing’ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion

    Effect of Number of Users in Multi-level Coded Caching

    Full text link
    It has been recently established that joint design of content delivery and storage (coded caching) can significantly improve performance over conventional caching. This has also been extended to the case when content has non-uniform popularity through several models. In this paper we focus on a multi-level popularity model, where content is divided into levels based on popularity. We consider two extreme cases of user distribution across caches for the multi-level popularity model: a single user per cache (single-user setup) versus a large number of users per cache (multi-user setup). When the capacity approximation is universal (independent of number of popularity levels as well as number of users, files and caches), we demonstrate a dichotomy in the order-optimal strategies for these two extreme cases. In the multi-user case, sharing memory among the levels is order-optimal, whereas for the single-user case clustering popularity levels and allocating all the memory to them is the order-optimal scheme. In proving these results, we develop new information-theoretic lower bounds for the problem.Comment: 13 pages; 2 figures. A shorter version is to appear in IEEE ISIT 201

    Efficient Information Theoretic Clustering on Discrete Lattices

    Full text link
    We consider the problem of clustering data that reside on discrete, low dimensional lattices. Canonical examples for this setting are found in image segmentation and key point extraction. Our solution is based on a recent approach to information theoretic clustering where clusters result from an iterative procedure that minimizes a divergence measure. We replace costly processing steps in the original algorithm by means of convolutions. These allow for highly efficient implementations and thus significantly reduce runtime. This paper therefore bridges a gap between machine learning and signal processing.Comment: This paper has been presented at the workshop LWA 201
    • 

    corecore