439 research outputs found
Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders
We show that a Modular Neural Network (MNN) can combine various speech
enhancement modules, each of which is a Deep Neural Network (DNN) specialized
on a particular enhancement job. Differently from an ordinary ensemble
technique that averages variations in models, the propose MNN selects the best
module for the unseen test signal to produce a greedy ensemble. We see this as
Collaborative Deep Learning (CDL), because it can reuse various already-trained
DNN models without any further refining. In the proposed MNN selecting the best
module during run time is challenging. To this end, we employ a speech
AutoEncoder (AE) as an arbitrator, whose input and output are trained to be as
similar as possible if its input is clean speech. Therefore, the AE can gauge
the quality of the module-specific denoised result by seeing its AE
reconstruction error, e.g. low error means that the module output is similar to
clean speech. We propose an MNN structure with various modules that are
specialized on dealing with a specific noise type, gender, and input
Signal-to-Noise Ratio (SNR) value, and empirically prove that it almost always
works better than an arbitrarily chosen DNN module and sometimes as good as an
oracle result
Voicing classification of visual speech using convolutional neural networks
The application of neural network and convolutional neural net- work (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of vi- sual speech. Experiments are conducted for both speaker de- pendent and speaker independent scenarios. A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively. The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection
Similarity-and-Independence-Aware Beamformer: Method for Target Source Extraction using Magnitude Spectrogram as Reference
This study presents a novel method for source extraction, referred to as the
similarity-and-independence-aware beamformer (SIBF). The SIBF extracts the
target signal using a rough magnitude spectrogram as the reference signal. The
advantage of the SIBF is that it can obtain an accurate target signal, compared
to the spectrogram generated by target-enhancing methods such as the speech
enhancement based on deep neural networks (DNNs). For the extraction, we extend
the framework of the deflationary independent component analysis, by
considering the similarity between the reference and extracted target, as well
as the mutual independence of all potential sources. To solve the extraction
problem by maximum-likelihood estimation, we introduce two source model types
that can reflect the similarity. The experimental results from the CHiME3
dataset show that the target signal extracted by the SIBF is more accurate than
the reference signal generated by the DNN.
Index Terms: semiblind source separation, similarity-and-independence-aware
beamformer, deflationary independent component analysis, source modelComment: Accepted in INTERSPEECH 202
- …