Search CORE

468 research outputs found

Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

Author: Raymond W.M. Ng
Mauro Nicolao
Thomas Hain
Ambikairajah
Anderson
BenZeghiba
BenZeghiba
Caraballo
Corboda
Davis
Dehak
D’Haro
D’Haro
Fék
Ferrer
Gauvain
Gibson
Glembek
Hazen
Hermansky
Joachims
Knill
Li
Li
Lööf
Ma
Muthusamy
Navrátil
Ng
Ng
Richardson
Schultz
Schwarz
Singer
Suzuki
Torres-Carrasquillo
Torres-Carrasquillo
Veselý
Vu
Xue
Zissman
Zissman
Publication venue: 'Elsevier BV'
Publication date: 01/11/2017
Field of study

Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

Crossref

Biblioteca Digital de la Comunidad de Madrid

White Rose Research Online

Error Correction based on Error Signatures applied to automatic speech recognition

Author: Telaar Dominic
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2015
Field of study

KITopen

Language modelling for speaker diarization in telephonic interviews

Author: Hernando Pericás Francisco Javier
India Massana Miquel Àngel
Rodríguez Fonollosa José Adrián
Publication venue: 'Elsevier BV'
Publication date: 01/03/2023
Field of study

The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.This work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

A Review of Deep Learning Techniques for Speech Processing

Author: Bhardwaj Rishabh
Majumder Navonil
Mehrish Ambuj
Mihalcea Rada
Poria Soujanya
Publication venue
Publication date: 01/05/2023
Field of study

The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

arXiv.org e-Print Archive