104 research outputs found

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Enhancing posterior based speech recognition systems

    Get PDF
    The use of local phoneme posterior probabilities has been increasingly explored for improving speech recognition systems. Hybrid hidden Markov model / artificial neural network (HMM/ANN) and Tandem are the most successful examples of such systems. In this thesis, we present a principled framework for enhancing the estimation of local posteriors, by integrating phonetic and lexical knowledge, as well as long contextual information. This framework allows for hierarchical estimation, integration and use of local posteriors from the phoneme up to the word level. We propose two approaches for enhancing the posteriors. In the first approach, phoneme posteriors estimated with an ANN (particularly multi-layer Perceptron โ€“ MLP) are used as emission probabilities in HMM forward-backward recursions. This yields new enhanced posterior estimates integrating HMM topological constraints (encoding specific phonetic and lexical knowledge), and long context. In the second approach, a temporal context of the regular MLP posteriors is post-processed by a secondary MLP, in order to learn inter and intra dependencies among the phoneme posteriors. The learned knowledge is integrated in the posterior estimation during the inference (forward pass) of the second MLP, resulting in enhanced posteriors. The use of resulting local enhanced posteriors is investigated in a wide range of posterior based speech recognition systems (e.g. Tandem and hybrid HMM/ANN), as a replacement or in combination with the regular MLP posteriors. The enhanced posteriors consistently outperform the regular posteriors in different applications over small and large vocabulary databases

    Holistic Vocabulary Independent Spoken Term Detection

    Get PDF
    Within this thesis, we aim at designing a loosely coupled holistic system for Spoken Term Detection (STD) on heterogeneous German broadcast data in selected application scenarios. Starting from STD on the 1-best output of a word-based speech recognizer, we study the performance of several subword units for vocabulary independent STD on a linguistically and acoustically challenging German corpus. We explore the typical error sources in subword STD, and find that they differ from the error sources in word-based speech search. We select, extend and combine a set of state-of-the-art methods for error compensation in STD in order to explicitly merge the corresponding STD error spaces through anchor-based approximate lattice retrieval. Novel methods for STD result verification are proposed in order to increase retrieval precision by exploiting external knowledge at search time. Error-compensating methods for STD typically suffer from high response times on large scale databases, and we propose scalable approaches suitable for large corpora. Highest STD accuracy is obtained by combining anchor-based approximate retrieval from both syllable lattice ASR and syllabified word ASR into a hybrid STD system, and pruning the result list using external knowledge with hybrid contextual and anti-query verification.Die vorliegende Arbeit beschreibt ein lose gekoppeltes, ganzheitliches System zur Sprachsuche auf heterogenenen deutschen Sprachdaten in unterschiedlichen Anwendungsszenarien. Ausgehend von einer wortbasierten Sprachsuche auf dem Transkript eines aktuellen Wort-Erkenners werden zunรคchst unterschiedliche Subwort-Einheiten fรผr die vokabularunabhรคngige Sprachsuche auf deutschen Daten untersucht. Auf dieser Basis werden die typischen Fehlerquellen in der Subwort-basierten Sprachsuche analysiert. Diese Fehlerquellen unterscheiden sich vom Fall der klassichen Suche im Worttranskript und mรผssen explizit adressiert werden. Die explizite Kompensation der unterschiedlichen Fehlerquellen erfolgt durch einen neuartigen hybriden Ansatz zur effizienten Ankerbasierten unscharfen Wortgraph-Suche. Darรผber hinaus werden neuartige Methoden zur Verifikation von Suchergebnissen vorgestellt, die zur Suchzeit verfรผgbares externes Wissen einbeziehen. Alle vorgestellten Verfahren werden auf einem umfangreichen Satz von deutschen Fernsehdaten mit Fokus auf ausgewรคhlte, reprรคsentative Einsatzszenarien evaluiert. Da Methoden zur Fehlerkompensation in der Sprachsuchforschung typischerweise zu hohen Laufzeiten bei der Suche in groรŸen Archiven fรผhren, werden insbesondere auch Szenarien mit sehr groรŸen Datenmengen betrachtet. Die hรถchste Suchleistung fรผr Archive mittlerer GrรถรŸe wird durch eine unscharfe und Anker-basierte Suche auf einem hybriden Index aus Silben-Wortgraphen und silbifizierter Wort-Erkennung erreicht, bei der die Suchergebnisse mit hybrider Verifikation bereinigt werden

    Phoneme-based Video Indexing Using Phonetic Disparity Search

    Get PDF
    This dissertation presents and evaluates a method to the video indexing problem by investigating a categorization method that transcribes audio content through Automatic Speech Recognition (ASR) combined with Dynamic Contextualization (DC), Phonetic Disparity Search (PDS) and Metaphone indexation. The suggested approach applies genome pattern matching algorithms with computational summarization to build a database infrastructure that provides an indexed summary of the original audio content. PDS complements the contextual phoneme indexing approach by optimizing topic seek performance and accuracy in large video content structures. A prototype was established to translate news broadcast video into text and phonemes automatically by using ASR utterance conversions. Each phonetic utterance extraction was then categorized, converted to Metaphones, and stored in a repository with contextual topical information attached and indexed for posterior search analysis. Following the original design strategy, a custom parallel interface was built to measure the capabilities of dissimilar phonetic queries and provide an interface for result analysis. The postulated solution provides evidence of a superior topic matching when compared to traditional word and phoneme search methods. Experimental results demonstrate that PDS can be 3.7% better than the same phoneme query, Metaphone search proved to be 154.6% better than the same phoneme seek and 68.1 % better than the equivalent word search

    ์žฌ๊ท€ํ˜• ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ์˜จ๋ผ์ธ ์Œ์„ฑ์ธ์‹

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 2. ์„ฑ์›์šฉ.์žฌ๊ท€ํ˜• ์ธ๊ณต์‹ ๊ฒฝ๋ง(recurrent neural network, RNN)์€ ์ตœ๊ทผ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค(sequence-to-sequence) ๋ฐฉ์‹์˜ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ์™”๋‹ค. ์ตœ๊ทผ์˜ ์Œ์„ฑ์ธ์‹์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ข…๋‹จ๊ฐ„(end-to-end) ํ›ˆ๋ จ ๋ฐฉ์‹์˜ ๋ฐœ์ „์œผ๋กœ ์ธํ•ด, RNN์€ ์ผ๋ จ์˜ ์˜ค๋””์˜ค ํŠน์ง•(feature)์„ ์ž…๋ ฅ์œผ๋กœ ํ•˜๊ณ  ์ผ๋ จ์˜ ๊ธ€์ž(character) ํ˜น์€ ๋‹จ์–ด๋“ค์„ ์ถœ๋ ฅ์œผ๋กœ ํ•˜๋Š” ๋‹จ์ผํ•œ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ค‘๊ฐ„์— ์Œ์†Œ ๋‹จ์œ„ ํ˜น์€ ๋ฐœ์Œ ์‚ฌ์ „(lexicon) ๋‹จ์œ„์˜ ๋ณ€ํ™˜์„ ๊ฑฐ์น˜์ง€ ์•Š๋Š”๋‹ค. ์ง€๊ธˆ๊นŒ์ง€, ๋Œ€๋ถ€๋ถ„์˜ ์ข…๋‹จ๊ฐ„ ์Œ์„ฑ์ธ์‹์€ ๊ธฐ์กด ๋ฐฉ์‹์œผ๋กœ ์–ป์€ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š” ๋ฐ ์ดˆ์ ์ด ๋งž์ถฐ์ ธ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋น„๋ก ์ข…๋‹จ๊ฐ„ ์Œ์„ฑ์ธ์‹ ๋ชจ๋ธ์ด ๊ธฐ์กด ์Œ์„ฑ์ธ์‹ ๋ชจ๋ธ๋งŒํผ์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Œ์—๋„, ์ด ๋ชจ๋ธ์€ ๋ณดํ†ต ๋ฏธ๋ฆฌ ์ž˜๋ผ์ง„ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐœํ™” ๋‹จ์œ„์˜ ์Œ์„ฑ์ธ์‹์— ์‚ฌ์šฉ๋˜์—ˆ๊ณ , ์‹ค์‹œ๊ฐ„์œผ๋กœ ์—ฐ์†์ ์ธ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„ ์‚ฌ์šฉํ•˜๋Š” ์Œ์„ฑ์ธ์‹์—๋Š” ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š์•˜๋‹ค. ์ด๊ฒƒ์€ ๋ฏธ๋ฆฌ ์ž˜๋ผ์ง„ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ RNN์€ ๋งค์šฐ ๊ธด ์˜ค๋””์˜ค ์ž…๋ ฅ์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ๋™์ž‘ํ•˜๋„๋ก ์ผ๋ฐ˜ํ™”(generalization)๊ฐ€ ๋˜๊ธฐ ์–ด๋ ค์› ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์œ„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฌดํ•œํžˆ ๊ธด ์‹œํ€€์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋Š” RNN ํ›ˆ๋ จ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ €, ์ด๋ฅผ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ๊ทธ๋ž˜ํ”ฝ ํ”„๋กœ์„ธ์„œ(graphics processing unit, GPU) ๊ธฐ๋ฐ˜ RNN ํ›ˆ๋ จ ํ”„๋ ˆ์ž„์›Œํฌ(framework)๋ฅผ ์„ค๋ช…ํ•œ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์ œํ•œ๋œ ์‹œ๊ฐ„์ถ• ์—ญ์ „ํŒŒ(truncated backpropagation through time, truncated BPTT)๋ฅผ ์‚ฌ์šฉํ•ด ํ›ˆ๋ จ๋˜๋ฉฐ, ๋•๋ถ„์— ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ์—ฐ์†์ ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์—ฐ๊ฒฐ์„ฑ ์‹œ๊ณ„์—ด ๋ถ„๋ฅ˜๊ธฐ(connectionist temporal classification, CTC) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์†์‹ค(loss) ๊ณ„์‚ฐ ๋ฐฉ์‹์„ ๋ณ€ํ˜•ํ•œ ์‹ค์‹œ๊ฐ„ CTC ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ๋ณด์ธ๋‹ค. ์ƒˆ๋กญ๊ฒŒ ์„ ๋ณด์ธ CTC ์†์‹ค ๊ณ„์‚ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ truncated BPTT ๊ธฐ๋ฐ˜์˜ RNN ํ›ˆ๋ จ์— ๋ฐ”๋กœ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ, RNN๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ข…๋‹จ๊ฐ„ ์‹ค์‹œ๊ฐ„ ์Œ์„ฑ์ธ์‹ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ด ๋ชจ๋ธ์€ ํฌ๊ฒŒ CTC ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๋Š” ์Œํ–ฅ(acoustic) RNN๊ณผ ๊ธ€์ž ๋‹จ์œ„ RNN ์–ธ์–ด ๋ชจ๋ธ(language model)๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ , ์ ‘๋‘์‚ฌ ํŠธ๋ฆฌ(prefix-tree) ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๋น” ํƒ์ƒ‰(beam search)์ด ์‚ฌ์šฉ๋˜์–ด ๋ฌดํ•œํ•œ ์ž…๋ ฅ ์˜ค๋””์˜ค์— ๋Œ€ํ•ด ๋””์ฝ”๋”ฉ(decoding)์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋””์ฝ”๋”ฉ ๋ฐฉ์‹์—๋Š” ์ƒˆ๋กœ์šด ๋น” ๊ฐ€์ง€์น˜๊ธฐ(beam pruning) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋„์ž…๋˜์–ด ํŠธ๋ฆฌ ๊ตฌ์กฐ์˜ ํฌ๊ธฐ๊ฐ€ ์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•œ๋‹ค. ์œ„ ์Œ์„ฑ์ธ์‹ ๋ชจ๋ธ์—๋Š” ๋ณ„๋„์˜ ์Œ์†Œ ๋ชจ๋ธ์ด๋‚˜ ๋ฐœ์Œ ์‚ฌ์ „์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š๊ณ , ๋ฌดํ•œํžˆ ๊ธด ์ผ๋ จ์˜ ์˜ค๋””์˜ค์— ๋Œ€ํ•ด ๋””์ฝ”๋”ฉ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ํŠน์ง•์ด ์žˆ๋‹ค. ์œ„ ๋ชจ๋ธ์€ ๋˜ํ•œ ๋‹ค๋ฅธ ์ข…๋‹จ๊ฐ„ ๋ชจ๋ธ์— ๋น„ํ•ด ๋งค์šฐ ์ ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ๋น„๊ฒฌ๋  ๋งŒํ•œ ์ •ํ™•๋„๋ฅผ ๋ณด์ธ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณ„์ธตํ˜• ๊ตฌ์กฐ(hierarchical structure)๋ฅผ ์ด์šฉํ•ด ๊ธ€์ž ๋‹จ์œ„ RNN ์–ธ์–ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ํŠนํžˆ, ์ด ๊ธ€์ž ๋‹จ์œ„ RNN ๋ชจ๋ธ์€ ๋น„์Šทํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๊ฐ–๋Š” ๋‹จ์–ด ๋‹จ์œ„ RNN ์–ธ์–ด ๋ชจ๋ธ๋ณด๋‹ค ๊ฐœ์„ ๋œ ์˜ˆ์ธก ๋ณต์žก๋„(perplexity)๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์ด ๊ธ€์ž ๋‹จ์œ„ RNN ์–ธ์–ด ๋ชจ๋ธ์„ ์•ž์„œ ์„ค๋ช…ํ•œ ๊ธ€์ž ๋‹จ์œ„ ์‹ค์‹œ๊ฐ„ ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์— ์ ์šฉํ•˜์—ฌ ๋”์šฑ ์ ์€ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์Œ์„ฑ์ธ์‹ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio. To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training. In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy. Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1 1.1 Automatic Speech Recognition 1 1.1.1 Traditional ASR 2 1.1.2 End-to-End ASR with Recurrent Neural Networks 3 1.1.3 Offline and Online ASR 3 1.2 Scope of the Dissertation 4 1.2.1 End-to-End Online ASR with RNNs 4 1.2.2 Challenges and Contributions 5 2 Flexible and Efficient RNN Training on GPUs 7 2.1 Introduction 7 2.2 Generalization 9 2.2.1 Generalized RNN Structure 9 2.2.2 Training 11 2.3 Parallelization 15 2.3.1 Intra-Stream Parallelism 15 2.3.2 Inter-Stream Parallelism 17 2.4 Experiments 18 2.5 Concluding Remarks 21 3 Online Sequence Training with Connectionist Temporal Classification 22 3.1 Introduction 22 3.2 Connectionist Temporal Classification 25 3.3 Online Sequence Training 28 3.3.1 Problem Definition 28 3.3.2 Overview of the Proposed Approach 29 3.3.3 CTC-TR: Standard CTC with Truncation 31 3.3.4 CTC-EM: EM-Based Online CTC 32 3.4 Training Continuously Running RNNs 37 3.5 Parallel Training 38 3.6 Experiments 39 3.6.1 End-to-End Speech Recognition with RNNs 39 3.6.2 Phoneme Recognition on TIMIT 46 3.7 Concluding Remarks 51 4 Character-Level Incremental Speech Recognition 52 4.1 Introduction 52 4.2 Models 54 4.2.1 Acoustic Model 54 4.2.2 Language Model 56 4.3 Character-Level Beam Search 57 4.3.1 Prefix-Tree-Based CTC Beam Search 57 4.3.2 Pruning 60 4.4 Experiments 62 4.5 Concluding Remarks 65 5 Character-Level Language Modeling with Hierarchical RNNs 66 5.1 Introduction 66 5.2 Related Work 68 5.2.1 Character-Level Language Modeling with RNNs 68 5.2.2 Character-Aware Word-Level Language Modeling 69 5.3 RNNs with External Clock and Reset Signals 70 5.4 Character-Level Language Modeling with a Hierarchical RNN 72 5.5 Experiments 75 5.5.1 Perplexity 76 5.5.2 End-to-End Automatic Speech Recognition (ASR) 79 5.6 Concluding Remarks 81 6 Conclusion 83 Bibliography 85 Abstract in Korean 98Docto

    Identifying unexpected words using in-context and out-of-context phoneme posteriors

    Get PDF
    The paper proposes and discusses a machine approach for identification of unexpected (zero or low probability) words. The approach is based on use of two parallel recognition channels, one channel employing sensory information from the speech signal together with a prior context information provided by the pronunciation dictionary and grammatical constraints, to estimate `in-context' posterior probabilities of phonemes, the other channel being independent of the context information and entirely driven by the sensory data to deliver estimates of `out-of-context' posterior probabilities of phonemes. A significant mismatch between the information from these two channels indicates unexpected word. The viability of this concept is demonstrated on identification of out-of-vocabulary digits in continuous digit streams. The comparison of these two channels provides a confidence measure on the output of the recognizer. Unlike conventional confidence measures, this measure is not relying on phone and word segmentation (boundary detection), thus it is not affected by possibly imperfect segment boundary detection. In addition, being a relative measure, it is more discriminative than the conventional posterior based measures

    Searching Spontaneous Conversational Speech:Proceedings of ACM SIGIR Workshop (SSCS2008)

    Get PDF

    Linguistically-motivated sub-word modeling with applications to speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Includes bibliographical references (p. 173-185).Despite the proliferation of speech-enabled applications and devices, speech-driven human-machine interaction still faces several challenges. One of theses issues is the new word or the out-of-vocabulary (OOV) problem, which occurs when the underlying automatic speech recognizer (ASR) encounters a word it does not "know". With ASR being deployed in constantly evolving domains such as restaurant ratings, or music querying, as well as on handheld devices, the new word problem continues to arise.This thesis is concerned with the OOV problem, and in particular with the process of modeling and learning the lexical properties of an OOV word through a linguistically-motivated sub-syllabic model. The linguistic model is designed using a context-free grammar which describes the sub-syllabic structure of English words, and encapsulates phonotactic and phonological constraints. The context-free grammar is supported by a probability model, which captures the statistics of the parses generated by the grammar and encodes spatio-temporal context. The two main outcomes of the grammar design are: (1) sub-word units, which encode pronunciation information, and can be viewed as clusters of phonemes; and (2) a high-quality alignment between graphemic and sub-word units, which results in hybrid entities denoted as spellnemes. The spellneme units are used in the design of a statistical bi-directional letter-to-sound (L2S) model, which plays a significant role in automatically learning the spelling and pronunciation of a new word.The sub-word units and the L2S model are assessed on the task of automatic lexicon generation. In a first set of experiments, knowledge of the spelling of the lexicon is assumed. It is shown that the phonemic pronunciations associated with the lexicon can be successfully learned using the L2S model as well as a sub-word recognizer.(cont.) In a second set of experiments, the assumption of perfect spelling knowledge is relaxed, and an iterative and unsupervised algorithm, denoted as Turbo-style, makes use of spoken instances of both spellings and words to learn the lexical entries in a dictionary.Sub-word speech recognition is also embedded in a parallel fashion as a backoff mechanism for a word recognizer. The resulting hybrid model is evaluated in a lexical access application, whereby a word recognizer first attempts to recognize an isolated word. Upon failure of the word recognizer, the sub-word recognizer is manually triggered. Preliminary results show that such a hybrid set-up outperforms a large-vocabulary recognizer.Finally, the sub-word units are embedded in a flat hybrid OOV model for continuous ASR. The hybrid ASR is deployed as a front-end to a song retrieval application, which is queried via spoken lyrics. Vocabulary compression and open-ended query recognition are achieved by designing a hybrid ASR. The performance of the frontend recognition system is reported in terms of sentence, word, and sub-word error rates. The hybrid ASR is shown to outperform a word-only system over a range of out-of-vocabulary rates (1%-50%). The retrieval performance is thoroughly assessed as a fmnction of ASR N-best size, language model order, and the index size. Moreover, it is shown that the sub-words outperform alternative linguistically-motivated sub-lexical units such as phonemes. Finally, it is observed that a dramatic vocabulary compression - by more than a factor of 10 - is accompanied by a minor loss in song retrieval performance.by Ghinwa F. Choueiter.Ph.D

    Wake-Up-Word Speech Recognition

    Get PDF

    Modularity and Neural Integration in Large-Vocabulary Continuous Speech Recognition

    Get PDF
    This Thesis tackles the problems of modularity in Large-Vocabulary Continuous Speech Recognition with use of Neural Network
    • โ€ฆ
    corecore