20,726 research outputs found
Clustering documents with active learning using Wikipedia
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%
Multi-view constrained clustering with an incomplete mapping between views
Multi-view learning algorithms typically assume a complete bipartite mapping
between the different views in order to exchange information during the
learning process. However, many applications provide only a partial mapping
between the views, creating a challenge for current methods. To address this
problem, we propose a multi-view algorithm based on constrained clustering that
can operate with an incomplete mapping. Given a set of pairwise constraints in
each view, our approach propagates these constraints using a local similarity
measure to those instances that can be mapped to the other views, allowing the
propagated constraints to be transferred across views via the partial mapping.
It uses co-EM to iteratively estimate the propagation within each view based on
the current clustering model, transfer the constraints across views, and then
update the clustering model. By alternating the learning process between views,
this approach produces a unified clustering model that is consistent with all
views. We show that this approach significantly improves clustering performance
over several other methods for transferring constraints and allows multi-view
clustering to be reliably applied when given a limited mapping between the
views. Our evaluation reveals that the propagated constraints have high
precision with respect to the true clusters in the data, explaining their
benefit to clustering performance in both single- and multi-view learning
scenarios
Semi-supervised model-based clustering with controlled clusters leakage
In this paper, we focus on finding clusters in partially categorized data
sets. We propose a semi-supervised version of Gaussian mixture model, called
C3L, which retrieves natural subgroups of given categories. In contrast to
other semi-supervised models, C3L is parametrized by user-defined leakage
level, which controls maximal inconsistency between initial categorization and
resulting clustering. Our method can be implemented as a module in practical
expert systems to detect clusters, which combine expert knowledge with true
distribution of data. Moreover, it can be used for improving the results of
less flexible clustering techniques, such as projection pursuit clustering. The
paper presents extensive theoretical analysis of the model and fast algorithm
for its efficient optimization. Experimental results show that C3L finds high
quality clustering model, which can be applied in discovering meaningful groups
in partially classified data
Dynamic Metric Learning from Pairwise Comparisons
Recent work in distance metric learning has focused on learning
transformations of data that best align with specified pairwise similarity and
dissimilarity constraints, often supplied by a human observer. The learned
transformations lead to improved retrieval, classification, and clustering
algorithms due to the better adapted distance or similarity measures. Here, we
address the problem of learning these transformations when the underlying
constraint generation process is nonstationary. This nonstationarity can be due
to changes in either the ground-truth clustering used to generate constraints
or changes in the feature subspaces in which the class structure is apparent.
We propose Online Convex Ensemble StrongLy Adaptive Dynamic Learning (OCELAD),
a general adaptive, online approach for learning and tracking optimal metrics
as they change over time that is highly robust to a variety of nonstationary
behaviors in the changing metric. We apply the OCELAD framework to an ensemble
of online learners. Specifically, we create a retro-initialized composite
objective mirror descent (COMID) ensemble (RICE) consisting of a set of
parallel COMID learners with different learning rates, demonstrate RICE-OCELAD
on both real and synthetic data sets and show significant performance
improvements relative to previously proposed batch and online distance metric
learning algorithms.Comment: to appear Allerton 2016. arXiv admin note: substantial text overlap
with arXiv:1603.0367
- ā¦