1 research outputs found
CST-former: Transformer with Channel-Spectro-Temporal Attention for Sound Event Localization and Detection
Sound event localization and detection (SELD) is a task for the
classification of sound events and the localization of direction of arrival
(DoA) utilizing multichannel acoustic signals. Prior studies employ spectral
and channel information as the embedding for temporal attention. However, this
usage limits the deep neural network from extracting meaningful features from
the spectral or spatial domains. Therefore, our investigation in this paper
presents a novel framework termed the Channel-Spectro-Temporal Transformer
(CST-former) that bolsters SELD performance through the independent application
of attention mechanisms to distinct domains. The CST-former architecture
employs distinct attention mechanisms to independently process channel,
spectral, and temporal information. In addition, we propose an unfolded local
embedding (ULE) technique for channel attention (CA) to generate informative
embedding vectors including local spectral and temporal information. Empirical
validation through experimentation on the 2022 and 2023 DCASE Challenge task3
datasets affirms the efficacy of employing attention mechanisms separated
across each domain and the benefit of ULE, in enhancing SELD performance.Comment: Accepted to ICASSP 202