1,000 research outputs found
Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders
We show that a Modular Neural Network (MNN) can combine various speech
enhancement modules, each of which is a Deep Neural Network (DNN) specialized
on a particular enhancement job. Differently from an ordinary ensemble
technique that averages variations in models, the propose MNN selects the best
module for the unseen test signal to produce a greedy ensemble. We see this as
Collaborative Deep Learning (CDL), because it can reuse various already-trained
DNN models without any further refining. In the proposed MNN selecting the best
module during run time is challenging. To this end, we employ a speech
AutoEncoder (AE) as an arbitrator, whose input and output are trained to be as
similar as possible if its input is clean speech. Therefore, the AE can gauge
the quality of the module-specific denoised result by seeing its AE
reconstruction error, e.g. low error means that the module output is similar to
clean speech. We propose an MNN structure with various modules that are
specialized on dealing with a specific noise type, gender, and input
Signal-to-Noise Ratio (SNR) value, and empirically prove that it almost always
works better than an arbitrarily chosen DNN module and sometimes as good as an
oracle result
Semi-Supervised Acoustic Model Training by Discriminative Data Selection from Multiple ASR Systems' Hypotheses
While the performance of ASR systems depends on the size of the training data, it is very costly to prepare accurate and faithful transcripts. In this paper, we investigate a semisupervised training scheme, which takes the advantage of huge quantities of unlabeled video lecture archive, particularly for the deep neural network (DNN) acoustic model. In the proposed method, we obtain ASR hypotheses by complementary GMM-and DNN-based ASR systems. Then, a set of CRF-based classifiers is trained to select the correct hypotheses and verify the selected data. The proposed hypothesis combination shows higher quality compared with the conventional system combination method (ROVER). Moreover, compared with the conventional data selection based on confidence measure score, our method is demonstrated more effective for filtering usable data. Significant improvement in the ASR accuracy is achieved over the baseline system and in comparison with the models trained with the conventional system combination and data selection methods
Mixture of Expert/Imitator Networks: Scalable Semi-supervised Learning Framework
The current success of deep neural networks (DNNs) in an increasingly broad
range of tasks involving artificial intelligence strongly depends on the
quality and quantity of labeled training data. In general, the scarcity of
labeled data, which is often observed in many natural language processing
tasks, is one of the most important issues to be addressed. Semi-supervised
learning (SSL) is a promising approach to overcoming this issue by
incorporating a large amount of unlabeled data. In this paper, we propose a
novel scalable method of SSL for text classification tasks. The unique property
of our method, Mixture of Expert/Imitator Networks, is that imitator networks
learn to "imitate" the estimated label distribution of the expert network over
the unlabeled data, which potentially contributes a set of features for the
classification. Our experiments demonstrate that the proposed method
consistently improves the performance of several types of baseline DNNs. We
also demonstrate that our method has the more data, better performance property
with promising scalability to the amount of unlabeled data.Comment: Accepted by AAAI 201
Sequence Teacher-Student Training of Acoustic Models for Automatic Free Speaking Language Assessment
A high performance automatic speech recognition (ASR) system is
an important constituent component of an automatic language assessment system for free speaking language tests. The ASR system
is required to be capable of recognising non-native spontaneous English
speech and to be deployable under real-time conditions. The
performance of ASR systems can often be significantly improved by
leveraging upon multiple systems that are complementary, such as an
ensemble. Ensemble methods, however, can be computationally expensive,
often requiring multiple decoding runs, which makes them
impractical for deployment. In this paper, a lattice-free implementation
of sequence-level teacher-student training is used to reduce this
computational cost, thereby allowing for real-time applications. This
method allows a single student model to emulate the performance of
an ensemble of teachers, but without the need for multiple decoding
runs. Adaptations of the student model to speakers from different
first languages (L1s) and grades are also explored.Cambridge Assessment Englis
- …