78 research outputs found
Clustering documents with active learning using Wikipedia
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%
Discovering New Intents with Deep Aligned Clustering
Discovering new intents is a crucial task in dialogue systems. Most existing
methods are limited in transferring the prior knowledge from known intents to
new intents. They also have difficulties in providing high-quality supervised
signals to learn clustering-friendly features for grouping unlabeled intents.
In this work, we propose an effective method, Deep Aligned Clustering, to
discover new intents with the aid of the limited known intent data. Firstly, we
leverage a few labeled known intent samples as prior knowledge to pre-train the
model. Then, we perform k-means to produce cluster assignments as
pseudo-labels. Moreover, we propose an alignment strategy to tackle the label
inconsistency problem during clustering assignments. Finally, we learn the
intent representations under the supervision of the aligned pseudo-labels. With
an unknown number of new intents, we predict the number of intent categories by
eliminating low-confidence intent-wise clusters. Extensive experiments on two
benchmark datasets show that our method is more robust and achieves substantial
improvements over the state-of-the-art methods. The codes are released at
https://github.com/thuiar/DeepAligned-Clustering.Comment: Accepted by AAAI 2021 (Main Track, Long Paper
Discovering New Intents via Constrained Deep Adaptive Clustering with Cluster Refinement
Identifying new user intents is an essential task in the dialogue system.
However, it is hard to get satisfying clustering results since the definition
of intents is strongly guided by prior knowledge. Existing methods incorporate
prior knowledge by intensive feature engineering, which not only leads to
overfitting but also makes it sensitive to the number of clusters. In this
paper, we propose constrained deep adaptive clustering with cluster refinement
(CDAC+), an end-to-end clustering method that can naturally incorporate
pairwise constraints as prior knowledge to guide the clustering process.
Moreover, we refine the clusters by forcing the model to learn from the high
confidence assignments. After eliminating low confidence assignments, our
approach is surprisingly insensitive to the number of clusters. Experimental
results on the three benchmark datasets show that our method can yield
significant improvements over strong baselines.Comment: Accepted by AAAI202
- ā¦