43,544 research outputs found
Phase transitions in spiked matrix estimation: information-theoretic analysis
We study here the so-called spiked Wigner and Wishart models, where one
observes a low-rank matrix perturbed by some Gaussian noise. These models
encompass many classical statistical tasks such as sparse PCA, submatrix
localization, community detection or Gaussian mixture clustering. The goal of
these notes is to present in a unified manner recent results (as well as new
developments) on the information-theoretic limits of these spiked matrix
models. We compute the minimal mean squared error for the estimation of the
low-rank signal and compare it to the performance of spectral estimators and
message passing algorithms. Phase transition phenomena are observed: depending
on the noise level it is either impossible, easy (i.e. using polynomial-time
estimators) or hard (information-theoretically possible, but no efficient
algorithm is known to succeed) to recover the signal.Comment: These notes present in a unified manner recent results (as well as
new developments) on the information-theoretic limits in spiked matrix
estimatio
Empirical analysis of rough set categorical clustering techniques based on rough purity and value set
Clustering a set of objects into homogeneous groups is a fundamental operation
in data mining. Recently, attention has been put on categorical data clustering,
where data objects are made up of non-numerical attributes. The implementation of
several existing categorical clustering techniques is challenging as some are unable
to handle uncertainty and others have stability issues. In the process of dealing
with categorical data and handling uncertainty, the rough set theory has become
well-established mechanism in a wide variety of applications including databases.
The recent techniques such as Information-Theoretic Dependency Roughness (ITDR),
Maximum Dependency Attribute (MDA) and Maximum Significance Attribute (MSA)
outperformed their predecessor approaches like Bi-Clustering (BC), Total Roughness
(TR), Min-Min Roughness (MMR), and standard-deviation roughness (SDR). This
work explores the limitations and issues of ITDR, MDA and MSA techniques on
data sets where these techniques fails to select or faces difficulty in selecting their
best clustering attribute. Accordingly, two alternative techniques named Rough Purity
Approach (RPA) and Maximum Value Attribute (MVA) are proposed. The novelty
of both proposed approaches is that, the RPA presents a new uncertainty definition
based on purity of rough relational data base whereas, the MVA unlike other rough
set theory techniques uses the domain knowledge such as value set combined with
number of clusters (NoC). To show the significance, mathematical and theoretical
basis for proposed approaches, several propositions are illustrated. Moreover, the
recent rough categorical techniques like MDA, MSA, ITDR and classical clustering
technique like simple K-mean are used for comparison and the results are presented
in tabular and graphical forms. For experiments, data sets from previously utilized
research cases, a real supply base management (SBM) data set and UCI repository
are utilized. The results reveal significant improvement by proposed techniques for
categorical clustering in terms of purity (21%), entropy (9%), accuracy (16%), rough
accuracy (11%), iterations (99%) and time (93%).
vi
Visualising the structure of document search results: A comparison of graph theoretic approaches
This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or âspatialisationâ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or âcluster growingâ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion
Effect of Number of Users in Multi-level Coded Caching
It has been recently established that joint design of content delivery and
storage (coded caching) can significantly improve performance over conventional
caching. This has also been extended to the case when content has non-uniform
popularity through several models. In this paper we focus on a multi-level
popularity model, where content is divided into levels based on popularity. We
consider two extreme cases of user distribution across caches for the
multi-level popularity model: a single user per cache (single-user setup)
versus a large number of users per cache (multi-user setup). When the capacity
approximation is universal (independent of number of popularity levels as well
as number of users, files and caches), we demonstrate a dichotomy in the
order-optimal strategies for these two extreme cases. In the multi-user case,
sharing memory among the levels is order-optimal, whereas for the single-user
case clustering popularity levels and allocating all the memory to them is the
order-optimal scheme. In proving these results, we develop new
information-theoretic lower bounds for the problem.Comment: 13 pages; 2 figures. A shorter version is to appear in IEEE ISIT 201
Efficient Information Theoretic Clustering on Discrete Lattices
We consider the problem of clustering data that reside on discrete, low
dimensional lattices. Canonical examples for this setting are found in image
segmentation and key point extraction. Our solution is based on a recent
approach to information theoretic clustering where clusters result from an
iterative procedure that minimizes a divergence measure. We replace costly
processing steps in the original algorithm by means of convolutions. These
allow for highly efficient implementations and thus significantly reduce
runtime. This paper therefore bridges a gap between machine learning and signal
processing.Comment: This paper has been presented at the workshop LWA 201
- âŠ