Search CORE

7 research outputs found

Quaternion Denoising Encoder-Decoder for Theme Identification of Telephone Conversations

Author: Linarès Georges
Morchid Mohamed
Parcollet Titouan
Publication venue: 'International Speech Communication Association'
Publication date: 20/08/2017
Field of study

International audienceIn the last decades, encoder-decoders or autoencoders (AE) have received a great interest from researchers due to their capability to construct robust representations of documents in a low dimensional subspace. Nonetheless, autoencoders reveal little in way of spoken document internal structure by only considering words or topics contained in the document as an isolate basic element, and tend to overfit with small corpus of documents. Therefore, Quaternion Multi-layer Perceptrons (QMLP) have been introduced to capture such internal latent dependencies , whereas denoising autoencoders (DAE) are composed with different stochastic noises to better process small set of documents. This paper presents a novel autoencoder based on both hitherto-proposed DAE (to manage small corpus) and the QMLP (to consider internal latent structures) called "Quater-nion denoising encoder-decoder" (QDAE). Moreover, the paper defines an original angular Gaussian noise adapted to the speci-ficity of hyper-complex algebra. The experiments, conduced on a theme identification task of spoken dialogues from the DE-CODA framework, show that the QDAE obtains the promising gains of 3% and 1.5% compared to the standard real valued de-noising autoencoder and the QMLP respectively

Crossref

M2H-GAN: A GAN-based Mapping from Machine to Human Transcripts for Speech Understanding

Author: Bost Xavier
Linarès Georges
Morchid Mohamed
Parcollet Titouan
Publication venue: HAL CCSD
Publication date: 15/09/2019
Field of study

International audienceDeep learning is at the core of recent spoken language understanding (SLU) related tasks. More precisely, deep neu-ral networks (DNNs) drastically increased the performances of SLU systems, and numerous architectures have been proposed. In the real-life context of theme identification of telephone conversations , it is common to hold both a human, manual (TRS) and an automatically transcribed (ASR) versions of the conversations. Nonetheless, and due to production constraints, only the ASR transcripts are considered to build automatic classi-fiers. TRS transcripts are only used to measure the performances of ASR systems. Moreover, the recent performances in term of classification accuracy, obtained by DNN related systems are close to the performances reached by humans, and it becomes difficult to further increase the performances by only considering the ASR transcripts. This paper proposes to dis-tillates the TRS knowledge available during the training phase within the ASR representation, by using a new generative adver-sarial network called M2H-GAN to generate a TRS-like version of an ASR document, to improve the theme identification performances

Recommended from our members

Enhancing the Generalization of Convolutional Neural Networks for Speech Emotion Recognition

Author: Guizzo E.
Publication venue
Publication date
Field of study

Human-machine interaction is rapidly gaining significance in our daily lives. While speech recognition has achieved near-human performance in recent years, the intricate details embedded in speech extend beyond the mere arrangement of words. Speech Emotion Recognition (SER) is therefore acquiring a growing role in this field by decoding not only the linguistic content but also the emotional nuances of human spoken communication and enabling therefore a more exhaustive comprehension of the information conveyed by speech signals. Despite the success that neural networks have already achieved in this task, SER is still challenging due to the variability of emotional expression, especially in real-world scenarios where generalization to unseen speakers and contexts is required. In addition, the high resource demand of SER models, combined with the scarcity of emotion-labelled data, hinder the development and application of effective solutions in this field. In this thesis, we present multiple approaches to overcome the aforementioned difficulties. We first introduce a multiple-time-scale (MTS) convolutional neural network architecture to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. We show that resilience to speed fluctuations is relevant in SER tasks, since emotion is expressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. The results indicate that the use of MTS consistently improves the generalization of networks of different capacity and depth, compared to standard convolution. In a second stage, we propose a more general approach to discourage unwanted sensitivity towards specific target properties in CNNs, introducing the novel concept of anti-transfer learning. While transfer learning assumes that the learning process for a target task will benefit from re-using representations learned for another task, anti-transfer avoids the learning of representations that have been learned for an orthogonal task, i.e., one that is not relevant and potentially confounding for the target task, such as speaker identity and speech content for emotion recognition. In anti-transfer learning we penalize similarity between activations of a network being trained and another network previously trained on an orthogonal task. This leads to better generalization and provides a degree of control over correlations that are spurious or undesirable. We show that anti-transfer actually leads to the intended invariance to the orthogonal task and to more appropriate feature maps for the target task at hand. Anti-transfer creates a computation and memory cost at training time, but it enables enables the reuse of pre-trained models. In order to avoid the high resource demand of SER models in general and anti-transfer learning specifically, we propose RH-emo, a novel semisupervised architecture aimed at extracting quaternion embeddings from realvalued monoaural spectrograms, enabling the use of quaternion-valued networks for SER tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. We show that the use of RHemo, combined with quaternion convolutional neural networks provides a consistent improvement in SER tasks, while requiring far fewer trainable parameters and therefore substantially reducing the resource demand of SER models. Finally, we apply anti-transfer learning to quaternion-valued neural networks fed with RH-emo embeddings. We demonstrate that the combination of the two approaches maintains the disentanglement properties of antitransfer, while using a reduced amount of memory, computation, and training time, making this a suitable approach for SER scenarios with limited resources and where context and speaker independence are needed

City Research Online