Existing contrastive learning methods for anomalous sound detection refine
the audio representation of each audio sample by using the contrast between the
samples' augmentations (e.g., with time or frequency masking). However, they
might be biased by the augmented data, due to the lack of physical properties
of machine sound, thereby limiting the detection performance. This paper uses
contrastive learning to refine audio representations for each machine ID,
rather than for each audio sample. The proposed two-stage method uses
contrastive learning to pretrain the audio representation model by
incorporating machine ID and a self-supervised ID classifier to fine-tune the
learnt model, while enhancing the relation between audio features from the same
ID. Experiments show that our method outperforms the state-of-the-art methods
using contrastive learning or self-supervised classification in overall anomaly
detection performance and stability on DCASE 2020 Challenge Task2 dataset.Comment: To appear in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 2023