We present Multiscale Audio Spectrogram Transformer (MAST) for audio
classification, which brings the concept of multiscale feature hierarchies to
the Audio Spectrogram Transformer (AST). Given an input audio spectrogram we
first patchify and project it into an initial temporal resolution and embedding
dimension, post which the multiple stages in MAST progressively expand the
embedding dimension while reducing the temporal resolution of the input. We use
a pyramid structure that allows early layers of MAST operating at a high
temporal resolution but low embedding space to model simple low-level acoustic
information and deeper temporally coarse layers to model high-level acoustic
information with high-dimensional embeddings. We also extend our approach to
present a new Self-Supervised Learning (SSL) method called SS-MAST, which
calculates a symmetric contrastive loss between latent representations from a
student and a teacher encoder. In practice, MAST significantly outperforms AST
by an average accuracy of 3.4% across 8 speech and non-speech tasks from the
LAPE Benchmark. Moreover, SS-MAST achieves an absolute average improvement of
2.6% over SSAST for both AST and MAST encoders. We make all our codes available
on GitHub at the time of publication.Comment: Submitted ICASSP 202