122 research outputs found

    Dynamic time warping applied to detection of confusable word pairs in automatic speech recognition

    Get PDF
    In this paper we present a rnethod to predict if two words are likely to be confused by an Autornatic SpeechRecognition (ASR) systern. This method is based on the c1assical Dynamic Time Warping (DTW) technique. This technique, which is usually used in ASR to measure the distance between two speech signals, is usedhere to calculate the distance between two words. With this distance the words are c1assified as confusable or not confusable using a threshold. We have tested the methodin ac1assicalfalse acceptance/false rejection framework and the Equal Error Rate (EER) was measured to be less than 3%.Peer Reviewe

    Spatial features of reverberant speech: estimation and application to recognition and diarization

    Get PDF
    Distant talking scenarios, such as hands-free calling or teleconference meetings, are essential for natural and comfortable human-machine interaction and they are being increasingly used in multiple contexts. The acquired speech signal in such scenarios is reverberant and affected by additive noise. This signal distortion degrades the performance of speech recognition and diarization systems creating troublesome human-machine interactions.This thesis proposes a method to non-intrusively estimate room acoustic parameters, paying special attention to a room acoustic parameter highly correlated with speech recognition degradation: clarity index. In addition, a method to provide information regarding the estimation accuracy is proposed. An analysis of the phoneme recognition performance for multiple reverberant environments is presented, from which a confusability metric for each phoneme is derived. This confusability metric is then employed to improve reverberant speech recognition performance. Additionally, room acoustic parameters can as well be used in speech recognition to provide robustness against reverberation. A method to exploit clarity index estimates in order to perform reverberant speech recognition is introduced. Finally, room acoustic parameters can also be used to diarize reverberant speech. A room acoustic parameter is proposed to be used as an additional source of information for single-channel diarization purposes in reverberant environments. In multi-channel environments, the time delay of arrival is a feature commonly used to diarize the input speech, however the computation of this feature is affected by reverberation. A method is presented to model the time delay of arrival in a robust manner so that speaker diarization is more accurately performed.Open Acces

    Speech vocoding for laboratory phonology

    Get PDF
    Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

    Robust automatic transcription of lectures

    Get PDF
    Automatic transcription of lectures is becoming an important task. Possible applications can be found in the fields of automatic translation or summarization, information retrieval, digital libraries, education and communication research. Ideally those systems would operate on distant recordings, freeing the presenter from wearing body-mounted microphones. This task, however, is surpassingly difficult, given that the speech signal is severely degraded by background noise and reverberation

    Predicting the performance of a speech recognition task.

    Get PDF
    Yau Pui Yuk.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 147-152).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Speech Recognition --- p.2Chapter 1.2.1 --- How Speech Recognition Works --- p.3Chapter 1.2.2 --- Types of Speech Recognition Tasks --- p.4Chapter 1.2.3 --- Variabilities in Speech 侀 a Challenge for Speech Recog- nition --- p.6Chapter 1.3 --- Performance Prediction of Speech Recognition Task --- p.7Chapter 1.4 --- Thesis Goals --- p.9Chapter 1.5 --- Thesis Organization --- p.10Chapter 2 --- Background --- p.11Chapter 2.1 --- The Acoustic-phonetic Approach --- p.12Chapter 2.1.1 --- Prediction based on the Degree of Mismatch --- p.12Chapter 2.1.2 --- Prediction based on Acoustic Similarity --- p.13Chapter 2.1.3 --- Prediction based on Between-Word Distance --- p.14Chapter 2.2 --- The Lexical Approach --- p.16Chapter 2.2.1 --- Perplexity --- p.16Chapter 2.2.2 --- SMR-perplexity --- p.17Chapter 2.3 --- The Combined Acoustic-phonetic and Lexical Approach --- p.18Chapter 2.3.1 --- Speech Decoder Entropy (SDE) --- p.19Chapter 2.3.2 --- Ideal Speech Decoding Difficulty (ISDD) --- p.20Chapter 2.4 --- Chapter Summary --- p.23Chapter 3 --- Components for Predicting the Performance of Speech Recog- nition Task --- p.24Chapter 3.1 --- Components of Speech Recognizer --- p.25Chapter 3.2 --- Word Similarity Measure --- p.27Chapter 3.2.1 --- Universal Phoneme Symbol (UPS) --- p.30Chapter 3.2.2 --- Definition of Phonetic Distance --- p.31Chapter 3.2.3 --- Definition of Word Pair Phonetic Distance --- p.45Chapter 3.2.4 --- Definition of Word Similarity Measure --- p.47Chapter 3.3 --- Word Occurrence Measure --- p.62Chapter 3.4 --- Chapter Summary --- p.64Chapter 4 --- Formulation of Recognition Error Predictive Index (REPI) --- p.65Chapter 4.1 --- Formulation of Recognition Error Predictive Index (REPI) --- p.66Chapter 4.2 --- Characteristics of Recognition Error Predictive Index (REPI) --- p.74Chapter 4.2.1 --- Weakness of Ideal Speech Decoding Difficulty (ISDD) --- p.75Chapter 4.2.2 --- Advantages of Recognition Error Predictive Index (REPI) --- p.79Chapter 4.3 --- Chapter Summary --- p.82Chapter 5 --- Experimental Design and Setup --- p.83Chapter 5.1 --- Objectives --- p.83Chapter 5.2 --- Experiments Preparation --- p.84Chapter 5.2.1 --- Speech Corpus and Speech Recognizers --- p.85Chapter 5.2.2 --- Speech Recognition Tasks --- p.93Chapter 5.2.3 --- Evaluation Criterion --- p.98Chapter 5.3 --- Experiment Categories and their Setup --- p.99Chapter 5.3.1 --- Experiment Category 1 侀 Investigating and comparing the overall prediction performance of the two predictive indices --- p.102Chapter 5.3.2 --- Experiment Category 2 侀 Comparing the applicability of the word similarity measures of the two predictive indices on predicting the recognition performance --- p.104Chapter 5.3.3 --- Experiment Category 3 - Comparing the applicability of the formulation method of the two predictive indices on predicting the recognition performance --- p.107Chapter 5.3.4 --- Experiment Category 4 侀 Comparing the performance of different phonetic distance definitions --- p.109Chapter 5.4 --- Chapter Summary --- p.111Chapter 6 --- Experimental Results and Analysis --- p.112Chapter 6.1 --- Experimental Results and Analysis --- p.113Chapter 6.1.1 --- Experiment Category 1 - Investigating and comparing the overall prediction performance of the two predictive indices --- p.113Chapter 6.1.2 --- Experiment Category 2- Comparing the applicability of the word similarity measures of the two predictive indices on predicting the recognition performance --- p.117Chapter 6.1.3 --- Experiment Category 3 侀 Comparing the applicability of the formulation method of the two predictive indices on predicting the recognition performance --- p.124Chapter 6.1.4 --- Experiment Category 4 - Comparing the performance of different phonetic distance definitions --- p.131Chapter 6.2 --- Experimental Summary --- p.137Chapter 6.3 --- Chapter Summary --- p.141Chapter 7 --- Conclusions --- p.142Chapter 7.1 --- Contributions --- p.144Chapter 7.2 --- Future Directions --- p.145Bibliography --- p.147Chapter A --- Table of Universal Phoneme Symbol --- p.153Chapter B --- Vocabulary Lists --- p.157Chapter C --- Experimental Results of Two-words Speech Recognition Tasks --- p.171Chapter D --- Experimental Results of Three-words Speech Recognition Tasks --- p.180Chapter E --- Significance Testing --- p.190Chapter E.1 --- Procedures of Significance Testing --- p.190Chapter E.2 --- Results of the Significance Testing --- p.191Chapter E.2.1 --- Experiment Category 1 --- p.191Chapter E.2.2 --- Experiment Category 2 --- p.192Chapter E.2.3 --- Experiment Category 3 --- p.194Chapter E.2.4 --- Experiment Category 4 --- p.196Chapter F --- Linear Regression Models --- p.19

    Robust Automatic Transcription of Lectures

    Get PDF
    Die automatische Transkription von VortrĂ€gen, Vorlesungen und PrĂ€sentationen wird immer wichtiger und ermöglicht erst die Anwendungen der automatischen Übersetzung von Sprache, der automatischen Zusammenfassung von Sprache, der gezielten Informationssuche in Audiodaten und somit die leichtere ZugĂ€nglichkeit in digitalen Bibliotheken. Im Idealfall arbeitet ein solches System mit einem Mikrofon das den Vortragenden vom Tragen eines Mikrofons befreit was der Fokus dieser Arbeit ist

    Intelligibility enhancement of synthetic speech in noise

    Get PDF
    EC Seventh Framework Programme (FP7/2007-2013)Speech technology can facilitate human-machine interaction and create new communication interfaces. Text-To-Speech (TTS) systems provide speech output for dialogue, notification and reading applications as well as personalized voices for people that have lost the use of their own. TTS systems are built to produce synthetic voices that should sound as natural, expressive and intelligible as possible and if necessary be similar to a particular speaker. Although naturalness is an important requirement, providing the correct information in adverse conditions can be crucial to certain applications. Speech that adapts or reacts to different listening conditions can in turn be more expressive and natural. In this work we focus on enhancing the intelligibility of TTS voices in additive noise. For that we adopt the statistical parametric paradigm for TTS in the shape of a hidden Markov model (HMM-) based speech synthesis system that allows for flexible enhancement strategies. Little is known about which human speech production mechanisms actually increase intelligibility in noise and how the choice of mechanism relates to noise type, so we approached the problem from another perspective: using mathematical models for hearing speech in noise. To find which models are better at predicting intelligibility of TTS in noise we performed listening evaluations to collect subjective intelligibility scores which we then compared to the models’ predictions. In these evaluations we observed that modifications performed on the spectral envelope of speech can increase intelligibility significantly, particularly if the strength of the modification depends on the noise and its level. We used these findings to inform the decision of which of the models to use when automatically modifying the spectral envelope of the speech according to the noise. We devised two methods, both involving cepstral coefficient modifications. The first was applied during extraction while training the acoustic models and the other when generating a voice using pre-trained TTS models. The latter has the advantage of being able to address fluctuating noise. To increase intelligibility of synthetic speech at generation time we proposed a method for Mel cepstral coefficient modification based on the glimpse proportion measure, the most promising of the models of speech intelligibility that we evaluated. An extensive series of listening experiments demonstrated that this method brings significant intelligibility gains to TTS voices while not requiring additional recordings of clear or Lombard speech. To further improve intelligibility we combined our method with noise-independent enhancement approaches based on the acoustics of highly intelligible speech. This combined solution was as effective for stationary noise as for the challenging competing speaker scenario, obtaining up to 4dB of equivalent intensity gain. Finally, we proposed an extension to the speech enhancement paradigm to account for not only energetic masking of signals but also for linguistic confusability of words in sentences. We found that word level confusability, a challenging value to predict, can be used as an additional prior to increase intelligibility even for simple enhancement methods like energy reallocation between words. These findings motivate further research into solutions that can tackle the effect of energetic masking on the auditory system as well as on higher levels of processing

    The effects of child language development on the performance of automatic speech recognition

    Get PDF
    In comparison to adults’, children’s ASR appears to be more challenging and yields inferior results. It has been suggested that for this issue to be addressed, linguistic understanding of children’s speech development needs to be employed to either provide a solution or an explanation. The present work aims to explore the influence of phonological effects associated with language acquisition (PEALA) in children’s ASR and investigate whether they can be detected in systematic patterns of ASR phone confusion errors or they can be evidenced in systematic patterns of acoustic feature structure. Findings from speech development research are used as the framework upon which a set of predictable error patterns is defined and guides the analysis of the experimental results reported. Several ASR experiments are conducted involving both children’s and adults’ speech. ASR phone confusion matrices are extracted and analysed according to a statistical significance test, proposed for the purposes of this work. A mathematical model is introduced to interpret the emerging results. Additionally, bottleneck features and i-vectors representing the acoustic features in one of the systems developed, are extracted and visualised using linear discriminant analysis (LDA). A qualitative analysis is conducted with reference to patterns that can be predicted through PEALA
    • 

    corecore