32 research outputs found
Multi-facet determination for clustering with Bayesian networks
Real world applications of sectors like industry, healthcare or finance usually generate data of
high complexity that can be interpreted from different viewpoints. When clustering this type of
data, a single set of clusters may not suffice, hence the necessity of methods that generate multiple
clusterings that represent different perspectives. In this paper, we present a novel multi-partition
clustering method that returns several interesting and non-redundant solutions, where each of them
is a data partition with an associated facet of data. Each of these facets represents a subset of the
original attributes that is selected using our information-theoretic criterion UMRMR. Our approach
is based on an optimization procedure that takes advantage of the Bayesian network factorization
to provide high quality solutions in a fraction of the time
AugDMC: Data Augmentation Guided Deep Multiple Clustering
Clustering aims to group similar objects together while separating dissimilar
ones apart. Thereafter, structures hidden in data can be identified to help
understand data in an unsupervised manner. Traditional clustering methods such
as k-means provide only a single clustering for one data set. Deep clustering
methods such as auto-encoder based clustering methods have shown a better
performance, but still provide a single clustering. However, a given dataset
might have multiple clustering structures and each represents a unique
perspective of the data. Therefore, some multiple clustering methods have been
developed to discover multiple independent structures hidden in data. Although
deep multiple clustering methods provide better performance, how to efficiently
capture the alternative perspectives in data is still a problem. In this paper,
we propose AugDMC, a novel data Augmentation guided Deep Multiple Clustering
method, to tackle the challenge. Specifically, AugDMC leverages data
augmentations to automatically extract features related to a certain aspect of
the data using a self-supervised prototype-based representation learning, where
different aspects of the data can be preserved under different data
augmentations. Moreover, a stable optimization strategy is proposed to
alleviate the unstable problem from different augmentations. Thereafter,
multiple clusterings based on different aspects of the data can be obtained.
Experimental results on three real-world datasets compared with
state-of-the-art methods validate the effectiveness of the proposed method
Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation
A good clustering can help a data analyst to explore and understand a data
set, but what constitutes a good clustering may depend on domain-specific and
application-specific criteria. These criteria can be difficult to formalize,
even when it is easy for an analyst to know a good clustering when she sees
one. We present a new approach to interactive clustering for data exploration,
called \ciif, based on a particularly simple feedback mechanism, in which an
analyst can choose to reject individual clusters and request new ones. The new
clusters should be different from previously rejected clusters while still
fitting the data well. We formalize this interaction in a novel Bayesian prior
elicitation framework. In each iteration, the prior is adapted to account for
all the previous feedback, and a new clustering is then produced from the
posterior distribution. To achieve the computational efficiency necessary for
an interactive setting, we propose an incremental optimization method over data
minibatches using Lagrangian relaxation. Experiments demonstrate that \ciif can
produce accurate and diverse clusterings
Feedback-Driven Data Clustering
The acquisition of data and its analysis has become a common yet critical task in many areas of modern economy and research. Unfortunately, the ever-increasing scale of datasets has long outgrown the capacities and abilities humans can muster to extract information from them and gain new knowledge. For this reason, research areas like data mining and knowledge discovery steadily gain importance. The algorithms they provide for the extraction of knowledge are mandatory prerequisites that enable people to analyze large amounts of information. Among the approaches offered by these areas, clustering is one of the most fundamental. By finding groups of similar objects inside the data, it aims to identify meaningful structures that constitute new knowledge. Clustering results are also often used as input for other analysis techniques like classification or forecasting.
As clustering extracts new and unknown knowledge, it obviously has no access to any form of ground truth. For this reason, clustering results have a hypothetical character and must be interpreted with respect to the application domain. This makes clustering very challenging and leads to an extensive and diverse landscape of available algorithms. Most of these are expert tools that are tailored to a single narrowly defined application scenario. Over the years, this specialization has become a major trend that arose to counter the inherent uncertainty of clustering by including as much domain specifics as possible into algorithms. While customized methods often improve result quality, they become more and more complicated to handle and lose versatility. This creates a dilemma especially for amateur users whose numbers are increasing as clustering is applied in more and more domains. While an abundance of tools is offered, guidance is severely lacking and users are left alone with critical tasks like algorithm selection, parameter configuration and the interpretation and adjustment of results.
This thesis aims to solve this dilemma by structuring and integrating the necessary steps of clustering into a guided and feedback-driven process. In doing so, users are provided with a default modus operandi for the application of clustering. Two main components constitute the core of said process: the algorithm management and the visual-interactive interface. Algorithm management handles all aspects of actual clustering creation and the involved methods. It employs a modular approach for algorithm description that allows users to understand, design, and compare clustering techniques with the help of building blocks. In addition, algorithm management offers facilities for the integration of multiple clusterings of the same dataset into an improved solution. New approaches based on ensemble clustering not only allow the utilization of different clustering techniques, but also ease their application by acting as an abstraction layer that unifies individual parameters. Finally, this component provides a multi-level interface that structures all available control options and provides the docking points for user interaction.
The visual-interactive interface supports users during result interpretation and adjustment. For this, the defining characteristics of a clustering are communicated via a hybrid visualization. In contrast to traditional data-driven visualizations that tend to become overloaded and unusable with increasing volume/dimensionality of data, this novel approach communicates the abstract aspects of cluster composition and relations between clusters. This aspect orientation allows the use of easy-to-understand visual components and makes the visualization immune to scale related effects of the underlying data. This visual communication is attuned to a compact and universally valid set of high-level feedback that allows the modification of clustering results. Instead of technical parameters that indirectly cause changes in the whole clustering by influencing its creation process, users can employ simple commands like merge or split to directly adjust clusters.
The orchestrated cooperation of these two main components creates a modus operandi, in which clusterings are no longer created and disposed as a whole until a satisfying result is obtained. Instead, users apply the feedback-driven process to iteratively refine an initial solution. Performance and usability of the proposed approach were evaluated with a user study. Its results show that the feedback-driven process enabled amateur users to easily create satisfying clustering results even from different and not optimal starting situations