10,118 research outputs found
Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis
The clustering ensemble technique aims to combine multiple clusterings into a
probably better and more robust clustering and has been receiving an increasing
attention in recent years. There are mainly two aspects of limitations in the
existing clustering ensemble approaches. Firstly, many approaches lack the
ability to weight the base clusterings without access to the original data and
can be affected significantly by the low-quality, or even ill clusterings.
Secondly, they generally focus on the instance level or cluster level in the
ensemble system and fail to integrate multi-granularity cues into a unified
model. To address these two limitations, this paper proposes to solve the
clustering ensemble problem via crowd agreement estimation and
multi-granularity link analysis. We present the normalized crowd agreement
index (NCAI) to evaluate the quality of base clusterings in an unsupervised
manner and thus weight the base clusterings in accordance with their clustering
validity. To explore the relationship between clusters, the source aware
connected triple (SACT) similarity is introduced with regard to their common
neighbors and the source reliability. Based on NCAI and multi-granularity
information collected among base clusterings, clusters, and data instances, we
further propose two novel consensus functions, termed weighted evidence
accumulation clustering (WEAC) and graph partitioning with multi-granularity
link analysis (GP-MGLA) respectively. The experiments are conducted on eight
real-world datasets. The experimental results demonstrate the effectiveness and
robustness of the proposed methods.Comment: The MATLAB source code of this work is available at:
https://www.researchgate.net/publication/28197031
From patterned response dependency to structured covariate dependency: categorical-pattern-matching
Data generated from a system of interest typically consists of measurements
from an ensemble of subjects across multiple response and covariate features,
and is naturally represented by one response-matrix against one
covariate-matrix. Likely each of these two matrices simultaneously embraces
heterogeneous data types: continuous, discrete and categorical. Here a matrix
is used as a practical platform to ideally keep hidden dependency among/between
subjects and features intact on its lattice. Response and covariate dependency
is individually computed and expressed through mutliscale blocks via a newly
developed computing paradigm named Data Mechanics. We propose a categorical
pattern matching approach to establish causal linkages in a form of information
flows from patterned response dependency to structured covariate dependency.
The strength of an information flow is evaluated by applying the combinatorial
information theory. This unified platform for system knowledge discovery is
illustrated through five data sets. In each illustrative case, an information
flow is demonstrated as an organization of discovered knowledge loci via
emergent visible and readable heterogeneity. This unified approach
fundamentally resolves many long standing issues, including statistical
modeling, multiple response, renormalization and feature selections, in data
analysis, but without involving man-made structures and distribution
assumptions. The results reported here enhance the idea that linking patterns
of response dependency to structures of covariate dependency is the true
philosophical foundation underlying data-driven computing and learning in
sciences.Comment: 32 pages, 10 figures, 3 box picture
ACCAMS: Additive Co-Clustering to Approximate Matrices Succinctly
Matrix completion and approximation are popular tools to capture a user's
preferences for recommendation and to approximate missing data. Instead of
using low-rank factorization we take a drastically different approach, based on
the simple insight that an additive model of co-clusterings allows one to
approximate matrices efficiently. This allows us to build a concise model that,
per bit of model learned, significantly beats all factorization approaches to
matrix approximation. Even more surprisingly, we find that summing over small
co-clusterings is more effective in modeling matrices than classic
co-clustering, which uses just one large partitioning of the matrix.
Following Occam's razor principle suggests that the simple structure induced
by our model better captures the latent preferences and decision making
processes present in the real world than classic co-clustering or matrix
factorization. We provide an iterative minimization algorithm, a collapsed
Gibbs sampler, theoretical guarantees for matrix approximation, and excellent
empirical evidence for the efficacy of our approach. We achieve
state-of-the-art results on the Netflix problem with a fraction of the model
complexity.Comment: 22 pages, under review for conference publicatio
- …