Search CORE

236 research outputs found

Context-Dependent Acoustic Modeling without Explicit Phone Clustering

Author: Beck Eugen
Ney Hermann
Raissi Tina
Schlüter Ralf
Publication venue
Publication date: 15/05/2020
Field of study

Phoneme-based acoustic modeling of large vocabulary automatic speech recognition takes advantage of phoneme context. The large number of context-dependent (CD) phonemes and their highly varying statistics require tying or smoothing to enable robust training. Usually, Classification and Regression Trees are used for phonetic clustering, which is standard in Hidden Markov Model (HMM)-based systems. However, this solution introduces a secondary training objective and does not allow for end-to-end training. In this work, we address a direct phonetic context modeling for the hybrid Deep Neural Network (DNN)/HMM, that does not build on any phone clustering algorithm for the determination of the HMM state inventory. By performing different decompositions of the joint probability of the center phoneme state and its left and right contexts, we obtain a factorized network consisting of different components, trained jointly. Moreover, the representation of the phonetic context for the network relies on phoneme embeddings. The recognition accuracy of our proposed models on the Switchboard task is comparable and outperforms slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

Crossref

Speech vocoding for laboratory phonology

Author: Benus Stefan
Cernak Milos
Lazaridis Alexandros
Publication venue: 'Elsevier BV'
Publication date: 19/05/2015
Field of study

Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Combination of Acoustic Classifiers Based on Dempster-Shafer Theory of Evidence

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Crossref

Graphical Models for Multi-dialect Arabic Isolated Words Recognition

Author: BenAyed Yassine
Gargouri Faiez
Zarrouk Elyes
Publication venue: The Authors. Published by Elsevier B.V.
Publication date: 31/12/2015
Field of study

AbstractThis paper presents the use of multiple hybrid systems for the recognition of isolated words from a large multi-dialect Arabic vocabulary. Such as the Hidden Markov models (HMM), Dynamic Bayesian networks (DBN) lack a discriminatory ability especially on speech recognition even if their progress is huge. Multi-Layer perceptrons (MLP) was applied in literature as an estimator of emission probabilities in HMM and proves it effectiveness. In order to ameliorate the results of recognition systems, we apply Support Vectors Machine (SVM) as an estimator of posterior probabilities since they are characterized by a high predictive power and discrimination. Moreover, they are based on a structural risk minimization (SRM) where the aim is to set up a classifier that minimizes a bound on the expected risk, rather than the empirical risk. In this work we have done a comparative study between three hybrid systems MLP/HMM, SVM/HMM and SVM/DBN and the standards models of HMM and DBN. In this paper, we describe the use of the hybrid model SVM/DBN for multi-dialect Arabic isolated words recognition. So, by using 67,132 speech files of Arabic isolated words, this work arises a comparative study of our acknowledgment system of it as the following: the use of especially the HMM standards leads to a recognition rate of 74.18%.as the average rate of 8 domains for everyone of the 4 dialects. Also, with the hybrid systems MLP/HMM and SVM/HMM we succeed in achieving the value of 77.74%.and 7806% respectively. Moreover, our proposed system SVM/DBN realizes the best performances, whereby, we achieve 87.67% as a recognition rate more than 83.01% obtained by GMM/DBN

Elsevier - Publisher Connector

Towards using hierarchical posteriors for flexible automatic speech recognition systems

Author: Bengio Samy
Bourlard Hervé
Magimai.-Doss Mathew
Mesot Bertrand
Morgan Nelson
Zhu Qifeng
Publication venue: IDIAP
Publication date: 10/03/2006
Field of study

Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., ``Tandem'') towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task

Infoscience - École polytechnique fédérale de Lausanne

Incremental construction of LSTM recurrent neural network

Author: Alquézar Mancho René
Ribeiro Evandsa Sabrine Lopes-Lima
Publication venue
Publication date: 01/01/2002
Field of study

Long Short--Term Memory (LSTM) is a recurrent neural network that uses structures called memory blocks to allow the net remember significant events distant in the past input sequence in order to solve long time lag tasks, where other RNN approaches fail. Throughout this work we have performed experiments using LSTM networks extended with growing abilities, which we call GLSTM. Four methods of training growing LSTM has been compared. These methods include cascade and fully connected hidden layers as well as two different levels of freezing previous weights in the cascade case. GLSTM has been applied to a forecasting problem in a biomedical domain, where the input/output behavior of five controllers of the Central Nervous System control has to be modelled. We have compared growing LSTM results against other neural networks approaches, and our work applying conventional LSTM to the task at hand.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

Author: Vu Ngoc Thang
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

KITopen