In recent years, dynamic parameterization of acoustic environments has raised
increasing attention in the field of audio processing. One of the key
parameters that characterize the local room acoustics in isolation from
orientation and directivity of sources and receivers is the geometric room
volume. Convolutional neural networks (CNNs) have been widely selected as the
main models for conducting blind room acoustic parameter estimation, which aims
to learn a direct mapping from audio spectrograms to corresponding labels. With
the recent trend of self-attention mechanisms, this paper introduces a purely
attention-based model to blindly estimate room volumes based on single-channel
noisy speech signals. We demonstrate the feasibility of eliminating the
reliance on CNN for this task and the proposed Transformer architecture takes
Gammatone magnitude spectral coefficients and phase spectrograms as inputs. To
enhance the model performance given the task-specific dataset, cross-modality
transfer learning is also applied. Experimental results demonstrate that the
proposed model outperforms traditional CNN models across a wide range of
real-world acoustics spaces, especially with the help of the dedicated
pretraining and data augmentation schemes.Comment: 5 pages, 4 figures, submitted ICASSP 202