2 research outputs found
Adversarial Training for Multilingual Acoustic Modeling
Multilingual training has been shown to improve acoustic modeling performance
by sharing and transferring knowledge in modeling different languages.
Knowledge sharing is usually achieved by using common lower-level layers for
different languages in a deep neural network. Recently, the domain adversarial
network was proposed to reduce domain mismatch of training data and learn
domain-invariant features. It is thus worth exploring whether adversarial
training can further promote knowledge sharing in multilingual models. In this
work, we apply the domain adversarial network to encourage the shared layers of
a multilingual model to learn language-invariant features. Bidirectional Long
Short-Term Memory (LSTM) recurrent neural networks (RNN) are used as building
blocks. We show that shared layers learned this way contain less language
identification information and lead to better performance. In an automatic
speech recognition task for seven languages, the resultant acoustic model
improves the word error rate (WER) of the multilingual model by 4% relative on
average, and the monolingual models by 10%
End-to-end Domain-Adversarial Voice Activity Detection
Voice activity detection is the task of detecting speech regions in a given
audio stream or recording. First, we design a neural network combining
trainable filters and recurrent layers to tackle voice activity detection
directly from the waveform. Experiments on the challenging DIHARD dataset show
that the proposed end-to-end model reaches state-of-the-art performance and
outperforms a variant where trainable filters are replaced by standard cepstral
coefficients. Our second contribution aims at making the proposed voice
activity detection model robust to domain mismatch. To that end, a domain
classification branch is added to the network and trained in an adversarial
manner. The same DIHARD dataset, drawn from 11 different domains is used for
evaluation under two scenarios. In the in-domain scenario where the training
and test sets cover the exact same domains, we show that the domain-adversarial
approach does not degrade performance of the proposed end-to-end model. In the
out-domain scenario where the test domain is different from training domains,
it brings a relative improvement of more than 10%. Finally, our last
contribution is the provision of a fully reproducible open-source pipeline than
can be easily adapted to other datasets.Comment: submitted to Interspeech 202