326 research outputs found
Towards More Efficient DNN-Based Speech Enhancement Using Quantized Correlation Mask
Many studies on deep learning-based speech enhancement (SE) utilizing the computational auditory scene analysis method typically employs the ideal binary mask or the ideal ratio mask to reconstruct the enhanced speech signal. However, many SE applications in real scenarios demand a desirable balance between denoising capability and computational cost. In this study, first, an improvement over the ideal ratio mask to attain more superior SE performance is proposed through introducing an efficient adaptive correlation-based factor for adjusting the ratio mask. The proposed method exploits the correlation coefficients among the noisy speech, noise and clean speech to effectively re-distribute the power ratio of the speech and noise during the ratio mask construction phase. Second, to make the supervised SE system more computationally-efficient, quantization techniques are considered to reduce the number of bits needed to represent floating numbers, leading to a more compact SE model. The proposed quantized correlation mask is utilized in conjunction with a 4-layer deep neural network (DNN-QCM) comprising dropout regulation, pre-training and noise-aware training to derive a robust and high-order mapping in enhancement, and to improve generalization capability in unseen conditions. Results show that the quantized correlation mask outperforms the conventional ratio mask representation and the other SE algorithms used for comparison. When compared to a DNN with ideal ratio mask as its learning targets, the DNN-QCM provided an improvement of approximately 6.5% in the short-time objective intelligibility score and 11.0% in the perceptual evaluation of speech quality score. The introduction of the quantization method can reduce the neural network weights to a 5-bit representation from a 32-bit, while effectively suppressing stationary and non-stationary noise. Timing analyses also show that with the techniques incorporated in the proposed DNN-QCM system to increase its compac..
Subjective intelligibility of speech sounds enhanced by ideal ratio mask via crowdsourced remote experiments with effective data screening
It is essential to perform speech intelligibility (SI) experiments with human
listeners to evaluate the effectiveness of objective intelligibility measures.
Recently crowdsourced remote testing has become popular to collect a massive
amount and variety of data with relatively small cost and in short time.
However, careful data screening is essential for attaining reliable SI data. We
compared the results of laboratory and crowdsourced remote experiments to
establish an effective data screening technique. We evaluated the SI of noisy
speech sounds enhanced by a single-channel ideal ratio mask (IRM) and
multi-channel mask-based beamformers. The results demonstrated that the SI
scores were improved by these enhancement methods. In particular, the
IRM-enhanced sounds were much better than the unprocessed and other enhanced
sounds, indicating IRM enhancement may give the upper limit of speech
enhancement performance. Moreover, tone pip tests, for which participants were
asked to report the number of audible tone pips, reduced the variability of
crowdsourced remote results so that the laboratory results became similar. Tone
pip tests could be useful for future crowdsourced experiments because of their
simplicity and effectiveness for data screening.Comment: This paper was submitted to Interspeech 2022
(http://www.interspeech2022.org
Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Isolating the voice of a specific person while filtering out other voices or
background noises is challenging when video is shot in noisy environments. We
propose audio-visual methods to isolate the voice of a single speaker and
eliminate unrelated sounds. First, face motions captured in the video are used
to estimate the speaker's voice, by passing the silent video frames through a
video-to-speech neural network-based model. Then the speech predictions are
applied as a filter on the noisy input audio. This approach avoids using
mixtures of sounds in the learning process, as the number of such possible
mixtures is huge, and would inevitably bias the trained model. We evaluate our
method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our
method attains significant SDR and PESQ improvements over the raw
video-to-speech predictions, and a well-known audio-only method.Comment: Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzo
- …