Search CORE

439 research outputs found

Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders

Author: Kim Minje
Publication venue
Publication date: 29/05/2017
Field of study

We show that a Modular Neural Network (MNN) can combine various speech enhancement modules, each of which is a Deep Neural Network (DNN) specialized on a particular enhancement job. Differently from an ordinary ensemble technique that averages variations in models, the propose MNN selects the best module for the unseen test signal to produce a greedy ensemble. We see this as Collaborative Deep Learning (CDL), because it can reuse various already-trained DNN models without any further refining. In the proposed MNN selecting the best module during run time is challenging. To this end, we employ a speech AutoEncoder (AE) as an arbitrator, whose input and output are trained to be as similar as possible if its input is clean speech. Therefore, the AE can gauge the quality of the module-specific denoised result by seeing its AE reconstruction error, e.g. low error means that the module output is similar to clean speech. We propose an MNN structure with various modules that are specialized on dealing with a specific noise type, gender, and input Signal-to-Noise Ratio (SNR) value, and empirically prove that it almost always works better than an arbitrarily chosen DNN module and sometimes as good as an oracle result

arXiv.org e-Print Archive

Crossref

Voicing classification of visual speech using convolutional neural networks

Author: Le Cornu Thomas
Milner Ben
Publication venue
Publication date: 01/01/2015
Field of study

The application of neural network and convolutional neural net- work (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of vi- sual speech. Experiments are conducted for both speaker de- pendent and speaker independent scenarios. A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively. The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection

University of East Anglia digital repository

Similarity-and-Independence-Aware Beamformer: Method for Target Source Extraction using Magnitude Spectrogram as Reference

Author: Hiroe Atsuo
Publication venue
Publication date: 24/08/2020
Field of study

This study presents a novel method for source extraction, referred to as the similarity-and-independence-aware beamformer (SIBF). The SIBF extracts the target signal using a rough magnitude spectrogram as the reference signal. The advantage of the SIBF is that it can obtain an accurate target signal, compared to the spectrogram generated by target-enhancing methods such as the speech enhancement based on deep neural networks (DNNs). For the extraction, we extend the framework of the deflationary independent component analysis, by considering the similarity between the reference and extracted target, as well as the mutual independence of all potential sources. To solve the extraction problem by maximum-likelihood estimation, we introduce two source model types that can reflect the similarity. The experimental results from the CHiME3 dataset show that the target signal extracted by the SIBF is more accurate than the reference signal generated by the DNN. Index Terms: semiblind source separation, similarity-and-independence-aware beamformer, deflationary independent component analysis, source modelComment: Accepted in INTERSPEECH 202

arXiv.org e-Print Archive

Combining i-vector representation and structured neural networks for rapid adaptation

Author: Gales MJF
Karanasou P
Wu C
Publication venue: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publication date: 01/03/2016
Field of study

Crossref

Apollo (Cambridge)