3 research outputs found
Explaining deep learning models for speech enhancement
International audienceWe consider the problem of explaining the robustness of neural networks used to compute time-frequency masks for speech enhancement to mismatched noise conditions. We employ the Deep SHapley Additive exPlanations (DeepSHAP) feature attribution method to quantify the contribution of every timefrequency bin in the input noisy speech signal to every timefrequency bin in the output time-frequency mask. We define an objective metric-referred to as the speech relevance scorethat summarizes the obtained SHAP values and show that it correlates with the enhancement performance, as measured by the word error rate on the CHiME-4 real evaluation dataset. We use the speech relevance score to explain the generalization ability of three speech enhancement models trained using synthetically generated speech-shaped noise, noise from a professional sound effects library, or real CHiME-4 noise. To the best of our knowledge, this is the first study on neural network explainability in the context of speech enhancement
DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in Degraded Audio Signals
Automatic speaker recognition algorithms typically use pre-defined
filterbanks, such as Mel-Frequency and Gammatone filterbanks, for
characterizing speech audio. The design of these filterbanks is based on
domain-knowledge and limited empirical observations. The resultant features,
therefore, may not generalize well to different types of audio degradation. In
this work, we propose a deep learning-based technique to induce the filterbank
design from vast amounts of speech audio. The purpose of such a filterbank is
to extract features robust to degradations in the input audio. To this effect,
a 1D convolutional neural network is designed to learn a time-domain filterbank
called DeepVOX directly from raw speech audio. Secondly, an adaptive triplet
mining technique is developed to efficiently mine the data samples best suited
to train the filterbank. Thirdly, a detailed ablation study of the DeepVOX
filterbanks reveals the presence of both vocal source and vocal tract
characteristics in the extracted features. Experimental results on VOXCeleb2,
NIST SRE 2008 and 2010, and Fisher speech datasets demonstrate the efficacy of
the DeepVOX features across a variety of audio degradations, multi-lingual
speech data, and varying-duration speech audio. The DeepVOX features also
improve the performance of existing speaker recognition algorithms, such as the
xVector-PLDA and the iVector-PLDA
Understanding and Visualizing Raw Waveform-based CNNs
Modeling directly raw waveforms through neural networks for speech processing is gaining more and more attention. Despite its varied success, a question that remains is: what kind of information are such neural networks capturing or learning for different tasks from the speech signal? Such an insight is not only interesting for advancing those techniques but also for understanding better speech signal characteristics. This paper takes a step in that direction, where we develop a gradient based approach to estimate the relevance of each speech sample input on the output score. We show that analysis of the resulting ``relevance signal" through conventional speech signal processing techniques can reveal the information modeled by the whole network. We demonstrate the potential of the proposed approach by analyzing raw waveform CNN-based phone recognition and speaker identification systems