12,803 research outputs found
Semi-supervised cross-entropy clustering with information bottleneck constraint
In this paper, we propose a semi-supervised clustering method, CEC-IB, that
models data with a set of Gaussian distributions and that retrieves clusters
based on a partial labeling provided by the user (partition-level side
information). By combining the ideas from cross-entropy clustering (CEC) with
those from the information bottleneck method (IB), our method trades between
three conflicting goals: the accuracy with which the data set is modeled, the
simplicity of the model, and the consistency of the clustering with side
information. Experiments demonstrate that CEC-IB has a performance comparable
to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but
is faster, more robust to noisy labels, automatically determines the optimal
number of clusters, and performs well when not all classes are present in the
side information. Moreover, in contrast to other semi-supervised models, it can
be successfully applied in discovering natural subgroups if the partition-level
side information is derived from the top levels of a hierarchical clustering
From patterned response dependency to structured covariate dependency: categorical-pattern-matching
Data generated from a system of interest typically consists of measurements
from an ensemble of subjects across multiple response and covariate features,
and is naturally represented by one response-matrix against one
covariate-matrix. Likely each of these two matrices simultaneously embraces
heterogeneous data types: continuous, discrete and categorical. Here a matrix
is used as a practical platform to ideally keep hidden dependency among/between
subjects and features intact on its lattice. Response and covariate dependency
is individually computed and expressed through mutliscale blocks via a newly
developed computing paradigm named Data Mechanics. We propose a categorical
pattern matching approach to establish causal linkages in a form of information
flows from patterned response dependency to structured covariate dependency.
The strength of an information flow is evaluated by applying the combinatorial
information theory. This unified platform for system knowledge discovery is
illustrated through five data sets. In each illustrative case, an information
flow is demonstrated as an organization of discovered knowledge loci via
emergent visible and readable heterogeneity. This unified approach
fundamentally resolves many long standing issues, including statistical
modeling, multiple response, renormalization and feature selections, in data
analysis, but without involving man-made structures and distribution
assumptions. The results reported here enhance the idea that linking patterns
of response dependency to structures of covariate dependency is the true
philosophical foundation underlying data-driven computing and learning in
sciences.Comment: 32 pages, 10 figures, 3 box picture
- …