Search CORE

65 research outputs found

Quaternion Denoising Encoder-Decoder for Theme Identification of Telephone Conversations

Author: Linarès Georges
Morchid Mohamed
Parcollet Titouan
Publication venue: 'International Speech Communication Association'
Publication date: 20/08/2017
Field of study

International audienceIn the last decades, encoder-decoders or autoencoders (AE) have received a great interest from researchers due to their capability to construct robust representations of documents in a low dimensional subspace. Nonetheless, autoencoders reveal little in way of spoken document internal structure by only considering words or topics contained in the document as an isolate basic element, and tend to overfit with small corpus of documents. Therefore, Quaternion Multi-layer Perceptrons (QMLP) have been introduced to capture such internal latent dependencies , whereas denoising autoencoders (DAE) are composed with different stochastic noises to better process small set of documents. This paper presents a novel autoencoder based on both hitherto-proposed DAE (to manage small corpus) and the QMLP (to consider internal latent structures) called "Quater-nion denoising encoder-decoder" (QDAE). Moreover, the paper defines an original angular Gaussian noise adapted to the speci-ficity of hyper-complex algebra. The experiments, conduced on a theme identification task of spoken dialogues from the DE-CODA framework, show that the QDAE obtains the promising gains of 3% and 1.5% compared to the standard real valued de-noising autoencoder and the QMLP respectively

Crossref

Deep quaternion neural networks for spoken language understanding

Author: Linarès Georges
Morchid Mohamed
Parcollet Titouan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/12/2017
Field of study

International audienceThe availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers

M2H-GAN: A GAN-based Mapping from Machine to Human Transcripts for Speech Understanding

Author: Bost Xavier
Linarès Georges
Morchid Mohamed
Parcollet Titouan
Publication venue: HAL CCSD
Publication date: 15/09/2019
Field of study

International audienceDeep learning is at the core of recent spoken language understanding (SLU) related tasks. More precisely, deep neu-ral networks (DNNs) drastically increased the performances of SLU systems, and numerous architectures have been proposed. In the real-life context of theme identification of telephone conversations , it is common to hold both a human, manual (TRS) and an automatically transcribed (ASR) versions of the conversations. Nonetheless, and due to production constraints, only the ASR transcripts are considered to build automatic classi-fiers. TRS transcripts are only used to measure the performances of ASR systems. Moreover, the recent performances in term of classification accuracy, obtained by DNN related systems are close to the performances reached by humans, and it becomes difficult to further increase the performances by only considering the ASR transcripts. This paper proposes to dis-tillates the TRS knowledge available during the training phase within the ASR representation, by using a new generative adver-sarial network called M2H-GAN to generate a TRS-like version of an ASR document, to improve the theme identification performances

Recommended from our members

Learning Speech Emotion Representations in the Quaternion Domain

Author: Comminiello D.
Guizzo E.
Scardapane S.
Weyde T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2023
Field of study

The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach

City Research Online

A study into automatic speaker verification with aspects of deep learning

Author: Jellyman Keith Andrew
Publication venue
Publication date: 01/07/2018
Field of study

Advancements in automatic speaker verification (ASV) can be considered to be primarily limited to improvements in modelling and classification techniques, capable of capturing ever larger amounts of speech data. This thesis begins by presenting a fairly extensive review of developments in ASV, up to the current state-of-the-art with i-vectors and PLDA. A series of practical tuning experiments then follows. It is found somewhat surprisingly, that even the training of the total variability matrix required for i-vector extraction, is potentially susceptible to unwanted variabilities. The thesis then explores the use of deep learning in ASV. A literature review is first made, with two training methodologies appearing evident: indirectly using a deep neural network trained for automatic speech recognition, and directly with speaker related output classes. The review finds that interest in direct training appears to be increasing, underpinned with the intent to discover new robust 'speaker embedding' representations. Last a preliminary experiment is presented, investigating the use of a deep convolutional network for speaker identification. The small set of results show that the network successfully identifies two test speakers, out of 84 possible speakers enrolled. It is hoped that subsequent research might lead to new robust speaker representations or features

University of Birmingham Research Archive, E-theses Repository

Breaking Down the Barriers To Operator Workload Estimation: Advancing Algorithmic Handling of Temporal Non-Stationarity and Cross-Participant Differences for EEG Analysis Using Deep Learning

Author: Hefron Ryan G.
Publication venue: AFIT Scholar
Publication date: 01/09/2018
Field of study

This research focuses on two barriers to using EEG data for workload assessment: day-to-day variability, and cross- participant applicability. Several signal processing techniques and deep learning approaches are evaluated in multi-task environments. These methods account for temporal, spatial, and frequential data dependencies. Variance of frequency- domain power distributions for cross-day workload classification is statistically significant. Skewness and kurtosis are not significant in an environment absent workload transitions, but are salient with transitions present. LSTMs improve day- to-day feature stationarity, decreasing error by 59% compared to previous best results. A multi-path convolutional recurrent model using bi-directional, residual recurrent layers significantly increases predictive accuracy and decreases cross-participant variance. Deep learning regression approaches are applied to a multi-task environment with workload transitions. Accounting for temporal dependence significantly reduces error and increases correlation compared to baselines. Visualization techniques for LSTM feature saliency are developed to understand EEG analysis model biases

AFTI Scholar (Air Force Institute of Technology)

eVisits in the digital era of Swedish primary care

Author: Entezarjou Artin
Publication venue: Lund University, Faculty of Medicine
Publication date: 01/01/2022
Field of study

Objective: To evaluate asynchronous digital visits (eVisits) with regard to digital communication, clinical decisionmaking,and subsequent care utilization in the digital era of primary care in Sweden.Methods: A mixed-methods approach was adopted across the various papers in the thesis, with all studiesevaluating the eVisit platform Flow in various clinical contexts.- Paper I was a comparative study of digital triage decisions when presented with automated patienthistory reports generated by the platform. Inter-rater reliability of triage decisions by majority vote in apanel of five physicians was compared to triage decisions by a machine learning model trained usingdata labelled by an expert primary care physician.- Paper II was a qualitative focus group study of nurse and physician experiences of digitalcommunication at three primary health care centers using the platform. Themes were generated usingqualitative content analysis as described by Graneheim and Lundman.- Papers III and IV were observational studies comparing office visits in the Skåne Region from Capio,a large private health care provider, to eVisit patients from Capio Go, a national eVisit service. Adultpatients with a chief complaint of sore throat, dysuria, or cough/common cold/influenza were recruited.eVisit patients were recruited prospectively digitally prior to their eVisit, while the office visit controlgroup was recruited retrospectively using letters. Paper III primarily compared antibiotic prescriptionrates per sore throat visit, while paper IV primarily compared subsequent physical health careutilization within two weeks for patients in the Skåne Region.Results: Interrater reliability was low (Cohen κ 0.17) between the panel majority vote and the machine learningmodel. Physicians and nurses experienced digitally filtered primary care, adjusting to a novel medium ofcommunication highlighting challenges in interpreting symptoms through text as well as alterations in practiceworkflow using asynchronous communication. Antibiotics prescription rate within three days was not higher aftereVisits compared to office visits (169/798 (21.2%) vs. 124/312 (39.7%) for sore throat, respectively; P<.001). Nosignificant differences in subsequent physical visits within two weeks (excluding the first 48 h of expected “digi-physical”care) were noted following eVisits compared to office visits (179 (18.0%) vs. 102 (17.6%); P = .854).Conclusions: eVisits do not seem to be associated with over-prescription of antibiotics, or over-utilization ofphysical health care when assessing common infectious symptoms. Given staff experiencing uncertainties ininterpretation of symptoms and triage decisions being inconsistent, eVisits may be best used as one of manymodalities to access primary care, with focus placed on facilitating patient-centered professional judgement bystaff, rather than automation of complex decisions

Lund University Publications