Search CORE

343 research outputs found

Consonant Recognition by Modular Constructlon of Large Phonemic Time-Delay Neural Networks

Author: Alex Waibel
Hidelumi Sawai
Kiyohiro Shikano
Publication venue
Publication date: 05/03/2020
Field of study

Abstract In this paper we show that neural networks for speech recognition can be constructed in a modular fashbn by expbiting the hidden structure of prevbusly trained phonetic subcategory networks. The performance of resulting larger phonetic nets was found to be as good as the performance of the subcomponent nets by themselves. This approach avoids the excessive learning times that would ba necessary to train larger networks and allows for incremental learning. Large time-delay neural networks constructed incrementally by applying these modular training techniques achieved a rewgnition performance of 96.0% for ail consonants and 94.7% for all phonemes. Introductlon Recently we have demonstrated that connectionist architectures capable of capturing some critical aspects of the dynamic nature of speech, can achieve superior recognition performance for daficuit but small phonemic discrimination tasks such as discrimination of the voiced consonants B, D and G 11.21. Encouraged by these results we wanted to explore the vestion, how we might expand on these models to make them useful for the design of speech recognition systems. A problem that emerges as we attempt to apply neural network models to the full speech recognition problem is the problem of scaling. Simply extending neural networks to ever larger structures and retraining them as one monolithic net quickly exceeds the capabilities of the fastest and largest supercomputers. The search complexity of finding a g w d solutions in a huge space of possible network configurations also soon assumes unmanageable proportions. Moreover, having to decide on all possible classes for recognition ahead of time as well as collecting sufficient data to train such a large monolithic network is impractical to say the least. In an effort to extend our models from mail recognition tasks to large scale speech recognition systems, we must therefore explore modularity and incremental learning as design strategies to break up a large learning task into smaller subtasks. Breaking up a large task into subtasks to be tackled by individual black boxes interconnected in ad hoc arrangements, on the other hand, would mean to abandon one of the most anractiie aspects of connectionism: the ability to perform complex constraint satistaction in a massively parallel and interconnected fashion, in view of an overall optimal performance goal. In this paper we demonstrate based on a set of experiments aimed at phoneme recognition that it is indeed possible to consttuct large neural networks incrementally by exploiting the hidden structure of smaller pretrained subcomponent networks. Small Phonemlc Classes by Time-Delay Neural In our previous work. we have proposed a Time-Delay Neural Network architecture (as shown on the left of Its mltilayer architecture, its shim- Osaka. 540, Japan invariance and the time delayed connectins of its units all contributed to its performance by allowing the net to develop complex, non-linear decislon surfaces and insensitivity to misalignments and by incorporating contextual information into decision making (see [l, 21 for detailed analysis and discussion). It is trained by the back-propagation pmcedure[3] using shared weighls for different time shifted positions of the net [l, 21. In spirit it has simi!arities to other models recently proposed [4,5]. This network. however, had only been trained for the voiced stops 6,D.G and we began our extensions by training similar networks for the other phonemic classes in our database. Ail phoneme tokens in our experiments were extracted using phonetic handlabels from a large vocabulary database of 5240 common Japanese words. Each word in the database was spoken in isolation by one male native Japanese speaker. All utterances were recorded in a sound pmoi booth and digitized at a 12 kHz sampling rate. The database was then split into a training set and a testing set of 2620 utterances each. A 150 msec range around a phoneme boundary was excised for each phoneme token and 1

CiteSeerX

Incremental construction of LSTM recurrent neural network

Author: Alquézar Mancho René
Ribeiro Evandsa Sabrine Lopes-Lima
Publication venue
Publication date: 01/01/2002
Field of study

Long Short--Term Memory (LSTM) is a recurrent neural network that uses structures called memory blocks to allow the net remember significant events distant in the past input sequence in order to solve long time lag tasks, where other RNN approaches fail. Throughout this work we have performed experiments using LSTM networks extended with growing abilities, which we call GLSTM. Four methods of training growing LSTM has been compared. These methods include cascade and fully connected hidden layers as well as two different levels of freezing previous weights in the cascade case. GLSTM has been applied to a forecasting problem in a biomedical domain, where the input/output behavior of five controllers of the Central Nervous System control has to be modelled. We have compared growing LSTM results against other neural networks approaches, and our work applying conventional LSTM to the task at hand.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

The Microsoft 2017 Conversational Speech Recognition System

Author: Alleva F.
Droppo J.
Huang X.
Stolcke A.
Wu L.
Xiong W.
Publication venue
Publication date: 24/08/2017
Field of study

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set

arXiv.org e-Print Archive

Crossref

The recognition of New Zealand English closing diphthongs using time-delay neural networks

Author: Kirkland John Robert
Publication venue: University of Canterbury. Electrical and Electronic Engineering
Publication date: 01/01/1995
Field of study

As a step towards the development of a modular time-delay neural network (TDNN) for recognizing phonemes realized with a New Zealand English accent, this thesis focuses on the development of an expert module for closing diphthong recognition. The performances of traditional and squad-based expert modules are compared speaker-dependently for two New Zealand English speakers (one male and one female). Examples of each kind of expert module are formed from one of three types of TDNN, referred to as basic-token TDNN, extended-token TDNN and sequence-token TDNN. Of the traditional expert modules tested, those comprising extended-token TDNNs are found to afford the best performance compromises and are, therefore, preferable if a traditional expert module is to be used. Comparing the traditional and squad-based expert modules tested, the latter afford significantly better recognition and/or false-positive error performances than the former, irrespective of the type of TDNN used. Consequently, it is concluded that squad-based expert modules are preferable to their traditional counterparts for closing diphthong recognition. Of the squad-based expert modules tested, those comprising sequence-token TDNNs are found to afford consistently better false-positive error performances than those comprising basic- or extended-token TDNNs, while similar recognition performances are afforded by all. Consequently, squad-based expert modules comprising sequence-token TDNNs are recommended as the preferred method of recognizing closing diphthongs realized with a New Zealand accent. This thesis also presents results demonstrating that squad-based expert modules comprising sequence-token TDNN s may be trained to accommodate multiple speakers and in a manner capable of handling both uncorrupted and highly corrupted speech utterances

UC Research Repository

Vowel synthesis using feed-forward neural networks

Author: Conway Stephen Malcolm
Publication venue: The University of Edinburgh
Publication date: 01/01/1994
Field of study

Edinburgh Research Archive

Why not model spoken word recognition instead of phoneme monitoring?

Author: de Gelder B.
Vroomen J.
Publication venue
Publication date: 01/01/2000
Field of study

Norris, McQueen & Cutler present a detailed account of the decision stage of the phoneme monitoring task. However, we question whether this contributes to our understanding of the speech recognition process itself, and we fail to see why phonotactic knowledge is playing a role in phoneme recognition.

Tilburg University Repository

Whole Word Phonetic Displays for Speech Articulation Training

Author: Meng Fansheng
Publication venue: ODU Digital Commons
Publication date: 01/04/2006
Field of study

The main objective of this dissertation is to investigate and develop speech recognition technologies for speech training for people with hearing impairments. During the course of this work, a computer aided speech training system for articulation speech training was also designed and implemented. The speech training system places emphasis on displays to improve children\u27s pronunciation of isolated Consonant-Vowel-Consonant (CVC) words, with displays at both the phonetic level and whole word level. This dissertation presents two hybrid methods for combining Hidden Markov Models (HMMs) and Neural Networks (NNs) for speech recognition. The first method uses NN outputs as posterior probability estimators for HMMs. The second method uses NNs to transform the original speech features to normalized features with reduced correlation. Based on experimental testing, both of the hybrid methods give higher accuracy than standard HMM methods. The second method, using the NN to create normalized features, outperforms the first method in terms of accuracy. Several graphical displays were developed to provide real time visual feedback to users, to help them to improve and correct their pronunciations

Old Dominion University

Tone classification of syllable -segmented Thai speech based on multilayer perceptron

Author: Satravaha Nuttavudh
Publication venue: The Research Repository @ WVU
Publication date: 01/05/2002
Field of study

Thai is a monosyllabic and tonal language. Thai makes use of tone to convey lexical information about the meaning of a syllable. Thai has five distinctive tones and each tone is well represented by a single F0 contour pattern. In general, a Thai syllable with a different tone has a different lexical meaning. Thus, to completely recognize a spoken Thai syllable, a speech recognition system has not only to recognize a base syllable but also to correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system.;In this study, a tone classification of syllable-segmented Thai speech which incorporates the effects of tonal coarticulation, stress and intonation was developed. Automatic syllable segmentation, which performs the segmentation on the training and test utterances into syllable units, was also developed. The acoustical features including fundamental frequency (F0), duration, and energy extracted from the processing syllable and neighboring syllables were used as the main discriminating features. A multilayer perceptron (MLP) trained by backpropagation method was employed to classify these features. The proposed system was evaluated on 920 test utterances spoken by five male and three female Thai speakers who also uttered the training speech. The proposed system achieved an average accuracy rate of 91.36%

The Research Repository @ WVU (West Virginia University)