MAST: Multiscale Audio Spectrogram Transformers

Ghosh, Sreyan; Manocha, Dinesh; Seth, Ashish; Umesh, S.

MAST: Multiscale Audio Spectrogram Transformers

Authors: Sreyan Ghosh
Dinesh Manocha
Ashish Seth
S. Umesh
Publication date: 2 November 2022
Publisher

Abstract

We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram we first patchify and project it into an initial temporal resolution and embedding dimension, post which the multiple stages in MAST progressively expand the embedding dimension while reducing the temporal resolution of the input. We use a pyramid structure that allows early layers of MAST operating at a high temporal resolution but low embedding space to model simple low-level acoustic information and deeper temporally coarse layers to model high-level acoustic information with high-dimensional embeddings. We also extend our approach to present a new Self-Supervised Learning (SSL) method called SS-MAST, which calculates a symmetric contrastive loss between latent representations from a student and a teacher encoder. In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark. Moreover, SS-MAST achieves an absolute average improvement of 2.6% over SSAST for both AST and MAST encoders. We make all our codes available on GitHub at the time of publication.Comment: Submitted ICASSP 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2211.01515

Last time updated on 08/12/2022