1,351 research outputs found
Improved Conditional Generative Adversarial Net Classification For Spoken Language Recognition
Recent research on generative adversarial nets (GAN) for language identification (LID) has shown promising results. In this paper, we further exploit the latent abilities of GAN networks to firstly combine them with deep neural network (DNN)-based i-vector approaches and then to improve the LID model using conditional generative adversarial net (cGAN) classification. First, phoneme dependent deep bottleneck features (DBF) combined with output posteriors of a pre-trained DNN for automatic speech recognition (ASR) are used to extract i-vectors in the normal way. These i-vectors are then classified using cGAN, and we show an effective method within the cGAN to optimize parameters by combining both language identification and verification signals as supervision. Results show firstly that cGAN methods can significantly outperform DBF DNN i-vector methods where 49-dimensional i-vectors are used, but not where 600-dimensional vectors are used. Secondly, training a cGAN discriminator network for direct classification has further benefit for low dimensional i-vectors as well as short utterances with high dimensional i-vectors. However, incorporating a dedicated discriminator network output layer for classification and optimizing both classification and verification loss brings benefits in all test cases
Self-imitating Feedback Generation Using GAN for Computer-Assisted Pronunciation Training
Self-imitating feedback is an effective and learner-friendly method for
non-native learners in Computer-Assisted Pronunciation Training. Acoustic
characteristics in native utterances are extracted and transplanted onto
learner's own speech input, and given back to the learner as a corrective
feedback. Previous works focused on speech conversion using prosodic
transplantation techniques based on PSOLA algorithm. Motivated by the visual
differences found in spectrograms of native and non-native speeches, we
investigated applying GAN to generate self-imitating feedback by utilizing
generator's ability through adversarial training. Because this mapping is
highly under-constrained, we also adopt cycle consistency loss to encourage the
output to preserve the global structure, which is shared by native and
non-native utterances. Trained on 97,200 spectrogram images of short utterances
produced by native and non-native speakers of Korean, the generator is able to
successfully transform the non-native spectrogram input to a spectrogram with
properties of self-imitating feedback. Furthermore, the transformed spectrogram
shows segmental corrections that cannot be obtained by prosodic
transplantation. Perceptual test comparing the self-imitating and correcting
abilities of our method with the baseline PSOLA method shows that the
generative approach with cycle consistency loss is promising
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
M2H-GAN: A GAN-based Mapping from Machine to Human Transcripts for Speech Understanding
International audienceDeep learning is at the core of recent spoken language understanding (SLU) related tasks. More precisely, deep neu-ral networks (DNNs) drastically increased the performances of SLU systems, and numerous architectures have been proposed. In the real-life context of theme identification of telephone conversations , it is common to hold both a human, manual (TRS) and an automatically transcribed (ASR) versions of the conversations. Nonetheless, and due to production constraints, only the ASR transcripts are considered to build automatic classi-fiers. TRS transcripts are only used to measure the performances of ASR systems. Moreover, the recent performances in term of classification accuracy, obtained by DNN related systems are close to the performances reached by humans, and it becomes difficult to further increase the performances by only considering the ASR transcripts. This paper proposes to dis-tillates the TRS knowledge available during the training phase within the ASR representation, by using a new generative adver-sarial network called M2H-GAN to generate a TRS-like version of an ASR document, to improve the theme identification performances
SEGAN: Speech Enhancement Generative Adversarial Network
Current speech enhancement techniques operate on the spectral domain and/or
exploit some higher-level feature. The majority of them tackle a limited number
of noise conditions and rely on first-order statistics. To circumvent these
issues, deep networks are being increasingly used, thanks to their ability to
learn complex functions from large example sets. In this work, we propose the
use of generative adversarial networks for speech enhancement. In contrast to
current techniques, we operate at the waveform level, training the model
end-to-end, and incorporate 28 speakers and 40 different noise conditions into
the same model, such that model parameters are shared across them. We evaluate
the proposed model using an independent, unseen test set with two speakers and
20 alternative noise conditions. The enhanced samples confirm the viability of
the proposed model, and both objective and subjective evaluations confirm the
effectiveness of it. With that, we open the exploration of generative
architectures for speech enhancement, which may progressively incorporate
further speech-centric design choices to improve their performance.Comment: 5 pages, 4 figures, accepted in INTERSPEECH 201
- …