Recent efforts have been made on acoustic scene classification in the audio
signal processing community. In contrast, few studies have been conducted on
acoustic scene clustering, which is a newly emerging problem. Acoustic scene
clustering aims at merging the audio recordings of the same class of acoustic
scene into a single cluster without using prior information and training
classifiers. In this study, we propose a method for acoustic scene clustering
that jointly optimizes the procedures of feature learning and clustering
iteration. In the proposed method, the learned feature is a deep embedding that
is extracted from a deep convolutional neural network (CNN), while the
clustering algorithm is the agglomerative hierarchical clustering (AHC). We
formulate a unified loss function for integrating and optimizing these two
procedures. Various features and methods are compared. The experimental results
demonstrate that the proposed method outperforms other unsupervised methods in
terms of the normalized mutual information and the clustering accuracy. In
addition, the deep embedding outperforms many state-of-the-art features.Comment: 9 pages, 6 figures, 11 tables. Accepted for publication in IEEE TM