The core of clustering is incorporating prior knowledge to construct
supervision signals. From classic k-means based on data compactness to recent
contrastive clustering guided by self-supervision, the evolution of clustering
methods intrinsically corresponds to the progression of supervision signals. At
present, substantial efforts have been devoted to mining internal supervision
signals from data. Nevertheless, the abundant external knowledge such as
semantic descriptions, which naturally conduces to clustering, is regrettably
overlooked. In this work, we propose leveraging external knowledge as a new
supervision signal to guide clustering, even though it seems irrelevant to the
given data. To implement and validate our idea, we design an externally guided
clustering method (Text-Aided Clustering, TAC), which leverages the textual
semantics of WordNet to facilitate image clustering. Specifically, TAC first
selects and retrieves WordNet nouns that best distinguish images to enhance the
feature discriminability. Then, to improve image clustering performance, TAC
collaborates text and image modalities by mutually distilling cross-modal
neighborhood information. Experiments demonstrate that TAC achieves
state-of-the-art performance on five widely used and three more challenging
image clustering benchmarks, including the full ImageNet-1K dataset