6,051 research outputs found
Basic Filters for Convolutional Neural Networks Applied to Music: Training or Design?
When convolutional neural networks are used to tackle learning problems based
on music or, more generally, time series data, raw one-dimensional data are
commonly pre-processed to obtain spectrogram or mel-spectrogram coefficients,
which are then used as input to the actual neural network. In this
contribution, we investigate, both theoretically and experimentally, the
influence of this pre-processing step on the network's performance and pose the
question, whether replacing it by applying adaptive or learned filters directly
to the raw data, can improve learning success. The theoretical results show
that approximately reproducing mel-spectrogram coefficients by applying
adaptive filters and subsequent time-averaging is in principle possible. We
also conducted extensive experimental work on the task of singing voice
detection in music. The results of these experiments show that for
classification based on Convolutional Neural Networks the features obtained
from adaptive filter banks followed by time-averaging perform better than the
canonical Fourier-transform-based mel-spectrogram coefficients. Alternative
adaptive approaches with center frequencies or time-averaging lengths learned
from training data perform equally well.Comment: Completely revised version; 21 pages, 4 figure
Does Single-channel Speech Enhancement Improve Keyword Spotting Accuracy? A Case Study
Noise robustness is a key aspect of successful speech applications. Speech
enhancement (SE) has been investigated to improve automatic speech recognition
accuracy; however, its effectiveness for keyword spotting (KWS) is still
under-investigated. In this paper, we conduct a comprehensive study on
single-channel speech enhancement for keyword spotting on the Google Speech
Command (GSC) dataset. To investigate robustness to noise, the GSC dataset is
augmented with noise signals from the WSJ0 Hipster Ambient Mixtures (WHAM!)
noise dataset. Our investigation includes not only applying SE before KWS but
also performing joint training of the SE frontend and KWS backend models.
Moreover, we explore audio injection, a common approach to reduce distortions
by using a weighted average of the enhanced and original signals. Audio
injection is then further optimized by using another model that predicts the
weight for each utterance. Our investigation reveals that SE can improve KWS
accuracy on noisy speech when the backend model is trained on clean speech;
however, despite our extensive exploration, it is difficult to improve the KWS
accuracy with SE when the backend is trained on noisy speech
- …