117,303 research outputs found

    Tutorial: End-to-End Speech Translation

    Get PDF
    Speech translation is the translation of speech in one language typically to text in another, traditionally accomplished through a combination of automatic speech recognition and machine translation. Speech translation has attracted interest for many years, but the recent successful applications of deep learning to both individual tasks have enabled new opportunities through joint modeling, in what we today call 'end-to-end speech translation.' In this tutorial we will introduce the techniques used in cutting-edge research on speech translation. Starting from the traditional cascaded approach, we will given an overview on data sources and model architectures to achieve state-of-the art performance with end-to-end speech translation for both high- and low-resource languages. In addition, we will discuss methods to evaluate analyze the proposed solutions, as well as the challenges faced when applying speech translation models for real-world applications

    Speech and translation technologies for voice-over and audio description: final results of the ALST project

    Get PDF
    The ALST project (FFI-201231024, funded by the Spanish Ministry of Economy) started in January 2013 aiming to research the implementation of speech technologies (speech recognition and speech synthesis) and translation technologies (machine translation) in two audiovisual transfer modes: audio description, and voice-over. The project will end in December 2015, and the presentation will give an overview of the main objectives of the project as well as its main results. Although limited in its funding and scope, the project has carried out innovative research in the following aspects: a) Implementation of automatic speech recognition and respeaking to generate faster transcripts: experiments have been carried out with professional transcribers to compare various working scenarios (manual transcription/automatic speech recognition/respeaking). b) Implementation of machine translation plus post-editing as an alternative to traditional audiovisual translation: experiments have been carried out using audio descriptions (English into Catalan) and also wildlife documentaries to be voiced over (English into Spanish). c) Implementation of text-to-speech systems instead of human voices, both in audio descriptions and documentaries. All experiments have in common the willingness to obtain objective data (generally linked to the productivity gain in terms of time) but also subjective data on user experience and perceived quality. The presentation will not get into a detailed analysis of all experiments for obvious time constraints, but will offer a general structured overview that will allow the audience to understand the main results of the project as well as possible future research venues

    μž¬κ·€ν˜• 인곡신경망을 μ΄μš©ν•œ 온라인 μŒμ„±μΈμ‹

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·컴퓨터곡학뢀, 2017. 2. μ„±μ›μš©.μž¬κ·€ν˜• 인곡신경망(recurrent neural network, RNN)은 졜근 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€(sequence-to-sequence) λ°©μ‹μ˜ μ—¬λŸ¬ λͺ¨λΈμ—μ„œ 쒋은 μ„±λŠ₯을 보여 μ™”λ‹€. 졜근의 μŒμ„±μΈμ‹μ—μ„œ μ‚¬μš©ν•˜λŠ” 쒅단간(end-to-end) ν›ˆλ ¨ λ°©μ‹μ˜ λ°œμ „μœΌλ‘œ 인해, RNN은 일련의 μ˜€λ””μ˜€ νŠΉμ§•(feature)을 μž…λ ₯으둜 ν•˜κ³  일련의 κΈ€μž(character) ν˜Ήμ€ 단어듀을 좜λ ₯으둜 ν•˜λŠ” λ‹¨μΌν•œ ν•¨μˆ˜λ₯Ό ν•™μŠ΅ν•  수 있게 λ˜μ—ˆλ‹€. 이 ν•¨μˆ˜λŠ” 쀑간에 μŒμ†Œ λ‹¨μœ„ ν˜Ήμ€ 발음 사전(lexicon) λ‹¨μœ„μ˜ λ³€ν™˜μ„ κ±°μΉ˜μ§€ μ•ŠλŠ”λ‹€. μ§€κΈˆκΉŒμ§€, λŒ€λΆ€λΆ„μ˜ 쒅단간 μŒμ„±μΈμ‹μ€ κΈ°μ‘΄ λ°©μ‹μœΌλ‘œ 얻은 높은 정확도λ₯Ό λ”°λΌκ°€λŠ” 데 초점이 맞좰져 μžˆμ—ˆλ‹€. ν•˜μ§€λ§Œ, 비둝 쒅단간 μŒμ„±μΈμ‹ λͺ¨λΈμ΄ κΈ°μ‘΄ μŒμ„±μΈμ‹ λͺ¨λΈλ§ŒνΌμ˜ 정확도λ₯Ό λ‹¬μ„±ν–ˆμŒμ—λ„, 이 λͺ¨λΈμ€ 보톡 미리 μž˜λΌμ§„ μ˜€λ””μ˜€ 데이터λ₯Ό μ‚¬μš©ν•˜λŠ” λ°œν™” λ‹¨μœ„μ˜ μŒμ„±μΈμ‹μ— μ‚¬μš©λ˜μ—ˆκ³ , μ‹€μ‹œκ°„μœΌλ‘œ 연속적인 μ˜€λ””μ˜€ 데이터λ₯Ό λ°›μ•„ μ‚¬μš©ν•˜λŠ” μŒμ„±μΈμ‹μ—λŠ” 잘 μ‚¬μš©λ˜μ§€ μ•Šμ•˜λ‹€. 이것은 미리 μž˜λΌμ§„ λ°μ΄ν„°λ‘œ ν•™μŠ΅ν•œ RNN은 맀우 κΈ΄ μ˜€λ””μ˜€ μž…λ ₯에 λŒ€ν•΄μ„œλ„ 잘 λ™μž‘ν•˜λ„λ‘ μΌλ°˜ν™”(generalization)κ°€ 되기 μ–΄λ €μ› κΈ° λ•Œλ¬Έμ΄λ‹€. μœ„ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ¬΄ν•œνžˆ κΈ΄ μ‹œν€€μŠ€λ₯Ό μ‚¬μš©ν•˜λŠ” RNN ν›ˆλ ¨ 방법을 μ œμ•ˆν•œλ‹€. λ¨Όμ €, 이λ₯Ό μœ„ν•œ 효과적인 κ·Έλž˜ν”½ ν”„λ‘œμ„Έμ„œ(graphics processing unit, GPU) 기반 RNN ν›ˆλ ¨ ν”„λ ˆμž„μ›Œν¬(framework)λ₯Ό μ„€λͺ…ν•œλ‹€. 이 ν”„λ ˆμž„μ›Œν¬λŠ” μ œν•œλœ μ‹œκ°„μΆ• μ—­μ „νŒŒ(truncated backpropagation through time, truncated BPTT)λ₯Ό μ‚¬μš©ν•΄ ν›ˆλ ¨λ˜λ©°, 덕뢄에 μ‹€μ‹œκ°„μœΌλ‘œ λ“€μ–΄μ˜€λŠ” 연속적인 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨ν•  수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, μ—°κ²°μ„± μ‹œκ³„μ—΄ λΆ„λ₯˜κΈ°(connectionist temporal classification, CTC) μ•Œκ³ λ¦¬μ¦˜μ˜ 손싀(loss) 계산 방식을 λ³€ν˜•ν•œ μ‹€μ‹œκ°„ CTC ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ„ 선보인닀. μƒˆλ‘­κ²Œ 선보인 CTC 손싀 계산 μ•Œκ³ λ¦¬μ¦˜μ€ truncated BPTT 기반의 RNN ν›ˆλ ¨μ— λ°”λ‘œ 적용될 수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, RNN만으둜 κ΅¬μ„±λœ 쒅단간 μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ λͺ¨λΈμ„ μ†Œκ°œν•œλ‹€. 이 λͺ¨λΈμ€ 크게 CTC 좜λ ₯을 μ‚¬μš©ν•˜λŠ” 음ν–₯(acoustic) RNNκ³Ό κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈ(language model)둜 κ΅¬μ„±λœλ‹€. 그리고, 접두사 트리(prefix-tree) 기반의 μƒˆλ‘œμš΄ λΉ” 탐색(beam search)이 μ‚¬μš©λ˜μ–΄ λ¬΄ν•œν•œ μž…λ ₯ μ˜€λ””μ˜€μ— λŒ€ν•΄ λ””μ½”λ”©(decoding)을 μˆ˜ν–‰ν•  수 μžˆλ‹€. 이 λ””μ½”λ”© λ°©μ‹μ—λŠ” μƒˆλ‘œμš΄ λΉ” κ°€μ§€μΉ˜κΈ°(beam pruning) μ•Œκ³ λ¦¬μ¦˜μ΄ λ„μž…λ˜μ–΄ 트리 ꡬ쑰의 크기가 μ§€μˆ˜μ μœΌλ‘œ μ¦κ°€ν•˜λŠ” 것을 λ°©μ§€ν•œλ‹€. μœ„ μŒμ„±μΈμ‹ λͺ¨λΈμ—λŠ” λ³„λ„μ˜ μŒμ†Œ λͺ¨λΈμ΄λ‚˜ 발음 사전이 ν¬ν•¨λ˜μ–΄ μžˆμ§€ μ•Šκ³ , λ¬΄ν•œνžˆ κΈ΄ 일련의 μ˜€λ””μ˜€μ— λŒ€ν•΄ 디코딩을 μˆ˜ν–‰ν•  수 μžˆλ‹€λŠ” νŠΉμ§•μ΄ μžˆλ‹€. μœ„ λͺ¨λΈμ€ λ˜ν•œ λ‹€λ₯Έ 쒅단간 λͺ¨λΈμ— λΉ„ν•΄ 맀우 적은 λ©”λͺ¨λ¦¬λ₯Ό μ‚¬μš©ν•˜λ©΄μ„œλ„ 비견될 λ§Œν•œ 정확도λ₯Ό 보인닀. λ§ˆμ§€λ§‰μœΌλ‘œ, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” κ³„μΈ΅ν˜• ꡬ쑰(hierarchical structure)λ₯Ό μ΄μš©ν•΄ κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ˜ μ„±λŠ₯을 ν–₯μƒμ‹œμΌ°λ‹€. 특히, 이 κΈ€μž λ‹¨μœ„ RNN λͺ¨λΈμ€ λΉ„μŠ·ν•œ νŒŒλΌλ―Έν„° 수λ₯Ό κ°–λŠ” 단어 λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈλ³΄λ‹€ κ°œμ„ λœ 예츑 λ³΅μž‘λ„(perplexity)λ₯Ό λ‹¬μ„±ν•˜μ˜€λ‹€. λ˜ν•œ, 이 κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ„ μ•žμ„œ μ„€λͺ…ν•œ κΈ€μž λ‹¨μœ„ μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ μ‹œμŠ€ν…œμ— μ μš©ν•˜μ—¬ λ”μš± 적은 연산을 μ‚¬μš©ν•˜λ©΄μ„œλ„ μŒμ„±μΈμ‹ 정확도λ₯Ό ν–₯μƒμ‹œν‚¬ 수 μžˆμ—ˆλ‹€.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio. To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training. In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy. Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1 1.1 Automatic Speech Recognition 1 1.1.1 Traditional ASR 2 1.1.2 End-to-End ASR with Recurrent Neural Networks 3 1.1.3 Offline and Online ASR 3 1.2 Scope of the Dissertation 4 1.2.1 End-to-End Online ASR with RNNs 4 1.2.2 Challenges and Contributions 5 2 Flexible and Efficient RNN Training on GPUs 7 2.1 Introduction 7 2.2 Generalization 9 2.2.1 Generalized RNN Structure 9 2.2.2 Training 11 2.3 Parallelization 15 2.3.1 Intra-Stream Parallelism 15 2.3.2 Inter-Stream Parallelism 17 2.4 Experiments 18 2.5 Concluding Remarks 21 3 Online Sequence Training with Connectionist Temporal Classification 22 3.1 Introduction 22 3.2 Connectionist Temporal Classification 25 3.3 Online Sequence Training 28 3.3.1 Problem Definition 28 3.3.2 Overview of the Proposed Approach 29 3.3.3 CTC-TR: Standard CTC with Truncation 31 3.3.4 CTC-EM: EM-Based Online CTC 32 3.4 Training Continuously Running RNNs 37 3.5 Parallel Training 38 3.6 Experiments 39 3.6.1 End-to-End Speech Recognition with RNNs 39 3.6.2 Phoneme Recognition on TIMIT 46 3.7 Concluding Remarks 51 4 Character-Level Incremental Speech Recognition 52 4.1 Introduction 52 4.2 Models 54 4.2.1 Acoustic Model 54 4.2.2 Language Model 56 4.3 Character-Level Beam Search 57 4.3.1 Prefix-Tree-Based CTC Beam Search 57 4.3.2 Pruning 60 4.4 Experiments 62 4.5 Concluding Remarks 65 5 Character-Level Language Modeling with Hierarchical RNNs 66 5.1 Introduction 66 5.2 Related Work 68 5.2.1 Character-Level Language Modeling with RNNs 68 5.2.2 Character-Aware Word-Level Language Modeling 69 5.3 RNNs with External Clock and Reset Signals 70 5.4 Character-Level Language Modeling with a Hierarchical RNN 72 5.5 Experiments 75 5.5.1 Perplexity 76 5.5.2 End-to-End Automatic Speech Recognition (ASR) 79 5.6 Concluding Remarks 81 6 Conclusion 83 Bibliography 85 Abstract in Korean 98Docto

    Robust speech recognition under noisy environments.

    Get PDF
    Lee Siu Wa.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 116-121).Abstracts in English and Chinese.Abstract --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- An Overview on Automatic Speech Recognition --- p.2Chapter 1.2 --- Thesis Outline --- p.6Chapter 2 --- Baseline Speech Recognition System --- p.8Chapter 2.1 --- Baseline Speech Recognition Framework --- p.8Chapter 2.2 --- Acoustic Feature Extraction --- p.11Chapter 2.2.1 --- Speech Production and Source-Filter Model --- p.12Chapter 2.2.2 --- Review of Feature Representations --- p.14Chapter 2.2.3 --- Mel-frequency Cepstral Coefficients --- p.20Chapter 2.2.4 --- Energy and Dynamic Features --- p.24Chapter 2.3 --- Back-end Decoder --- p.26Chapter 2.4 --- English Digit String Corpus Β΄Ψ€ AURORA2 --- p.28Chapter 2.5 --- Baseline Recognition Experiment --- p.31Chapter 3 --- A Simple Recognition Framework with Model Selection --- p.34Chapter 3.1 --- Mismatch between Training and Testing Conditions --- p.34Chapter 3.2 --- Matched Training and Testing Conditions --- p.38Chapter 3.2.1 --- Noise type-Matching --- p.38Chapter 3.2.2 --- SNR-Matching --- p.43Chapter 3.2.3 --- Noise Type and SNR-Matching --- p.44Chapter 3.3 --- Recognition Framework with Model Selection --- p.48Chapter 4 --- Noise Spectral Estimation --- p.53Chapter 4.1 --- Introduction to Statistical Estimation Methods --- p.53Chapter 4.1.1 --- Conventional Estimation Methods --- p.54Chapter 4.1.2 --- Histogram Technique --- p.55Chapter 4.2 --- Quantile-based Noise Estimation (QBNE) --- p.57Chapter 4.2.1 --- Overview of Quantile-based Noise Estimation (QBNE) --- p.58Chapter 4.2.2 --- Time-Frequency Quantile-based Noise Estimation (T-F QBNE) --- p.62Chapter 4.2.3 --- Mainlobe-Resilient Time-Frequency Quantile-based Noise Estimation (M-R T-F QBNE) --- p.65Chapter 4.3 --- Estimation Performance Analysis --- p.72Chapter 4.4 --- Recognition Experiment with Model Selection --- p.74Chapter 5 --- Feature Compensation: Algorithm and Experiment --- p.81Chapter 5.1 --- Feature Deviation from Clean Speech --- p.81Chapter 5.1.1 --- Deviation in MFCC Features --- p.82Chapter 5.1.2 --- Implications for Feature Compensation --- p.84Chapter 5.2 --- Overview of Conventional Compensation Methods --- p.86Chapter 5.3 --- Feature Compensation by In-phase Feature Induction --- p.94Chapter 5.3.1 --- Motivation --- p.94Chapter 5.3.2 --- Methodology --- p.97Chapter 5.4 --- Compensation Framework for Magnitude Spectrum and Segmen- tal Energy --- p.102Chapter 5.5 --- Recognition -Experiments --- p.103Chapter 6 --- Conclusions --- p.112Chapter 6.1 --- Summary and Discussions --- p.112Chapter 6.2 --- Future Directions --- p.114Bibliography --- p.11

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Towards Affordable Disclosure of Spoken Word Archives

    Get PDF
    This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research
    • …
    corecore