1,148 research outputs found

    μž¬κ·€ν˜• 인곡신경망을 μ΄μš©ν•œ 온라인 μŒμ„±μΈμ‹

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·컴퓨터곡학뢀, 2017. 2. μ„±μ›μš©.μž¬κ·€ν˜• 인곡신경망(recurrent neural network, RNN)은 졜근 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€(sequence-to-sequence) λ°©μ‹μ˜ μ—¬λŸ¬ λͺ¨λΈμ—μ„œ 쒋은 μ„±λŠ₯을 보여 μ™”λ‹€. 졜근의 μŒμ„±μΈμ‹μ—μ„œ μ‚¬μš©ν•˜λŠ” 쒅단간(end-to-end) ν›ˆλ ¨ λ°©μ‹μ˜ λ°œμ „μœΌλ‘œ 인해, RNN은 일련의 μ˜€λ””μ˜€ νŠΉμ§•(feature)을 μž…λ ₯으둜 ν•˜κ³  일련의 κΈ€μž(character) ν˜Ήμ€ 단어듀을 좜λ ₯으둜 ν•˜λŠ” λ‹¨μΌν•œ ν•¨μˆ˜λ₯Ό ν•™μŠ΅ν•  수 있게 λ˜μ—ˆλ‹€. 이 ν•¨μˆ˜λŠ” 쀑간에 μŒμ†Œ λ‹¨μœ„ ν˜Ήμ€ 발음 사전(lexicon) λ‹¨μœ„μ˜ λ³€ν™˜μ„ κ±°μΉ˜μ§€ μ•ŠλŠ”λ‹€. μ§€κΈˆκΉŒμ§€, λŒ€λΆ€λΆ„μ˜ 쒅단간 μŒμ„±μΈμ‹μ€ κΈ°μ‘΄ λ°©μ‹μœΌλ‘œ 얻은 높은 정확도λ₯Ό λ”°λΌκ°€λŠ” 데 초점이 맞좰져 μžˆμ—ˆλ‹€. ν•˜μ§€λ§Œ, 비둝 쒅단간 μŒμ„±μΈμ‹ λͺ¨λΈμ΄ κΈ°μ‘΄ μŒμ„±μΈμ‹ λͺ¨λΈλ§ŒνΌμ˜ 정확도λ₯Ό λ‹¬μ„±ν–ˆμŒμ—λ„, 이 λͺ¨λΈμ€ 보톡 미리 μž˜λΌμ§„ μ˜€λ””μ˜€ 데이터λ₯Ό μ‚¬μš©ν•˜λŠ” λ°œν™” λ‹¨μœ„μ˜ μŒμ„±μΈμ‹μ— μ‚¬μš©λ˜μ—ˆκ³ , μ‹€μ‹œκ°„μœΌλ‘œ 연속적인 μ˜€λ””μ˜€ 데이터λ₯Ό λ°›μ•„ μ‚¬μš©ν•˜λŠ” μŒμ„±μΈμ‹μ—λŠ” 잘 μ‚¬μš©λ˜μ§€ μ•Šμ•˜λ‹€. 이것은 미리 μž˜λΌμ§„ λ°μ΄ν„°λ‘œ ν•™μŠ΅ν•œ RNN은 맀우 κΈ΄ μ˜€λ””μ˜€ μž…λ ₯에 λŒ€ν•΄μ„œλ„ 잘 λ™μž‘ν•˜λ„λ‘ μΌλ°˜ν™”(generalization)κ°€ 되기 μ–΄λ €μ› κΈ° λ•Œλ¬Έμ΄λ‹€. μœ„ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ¬΄ν•œνžˆ κΈ΄ μ‹œν€€μŠ€λ₯Ό μ‚¬μš©ν•˜λŠ” RNN ν›ˆλ ¨ 방법을 μ œμ•ˆν•œλ‹€. λ¨Όμ €, 이λ₯Ό μœ„ν•œ 효과적인 κ·Έλž˜ν”½ ν”„λ‘œμ„Έμ„œ(graphics processing unit, GPU) 기반 RNN ν›ˆλ ¨ ν”„λ ˆμž„μ›Œν¬(framework)λ₯Ό μ„€λͺ…ν•œλ‹€. 이 ν”„λ ˆμž„μ›Œν¬λŠ” μ œν•œλœ μ‹œκ°„μΆ• μ—­μ „νŒŒ(truncated backpropagation through time, truncated BPTT)λ₯Ό μ‚¬μš©ν•΄ ν›ˆλ ¨λ˜λ©°, 덕뢄에 μ‹€μ‹œκ°„μœΌλ‘œ λ“€μ–΄μ˜€λŠ” 연속적인 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨ν•  수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, μ—°κ²°μ„± μ‹œκ³„μ—΄ λΆ„λ₯˜κΈ°(connectionist temporal classification, CTC) μ•Œκ³ λ¦¬μ¦˜μ˜ 손싀(loss) 계산 방식을 λ³€ν˜•ν•œ μ‹€μ‹œκ°„ CTC ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ„ 선보인닀. μƒˆλ‘­κ²Œ 선보인 CTC 손싀 계산 μ•Œκ³ λ¦¬μ¦˜μ€ truncated BPTT 기반의 RNN ν›ˆλ ¨μ— λ°”λ‘œ 적용될 수 μžˆλ‹€. λ‹€μŒμœΌλ‘œ, RNN만으둜 κ΅¬μ„±λœ 쒅단간 μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ λͺ¨λΈμ„ μ†Œκ°œν•œλ‹€. 이 λͺ¨λΈμ€ 크게 CTC 좜λ ₯을 μ‚¬μš©ν•˜λŠ” 음ν–₯(acoustic) RNNκ³Ό κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈ(language model)둜 κ΅¬μ„±λœλ‹€. 그리고, 접두사 트리(prefix-tree) 기반의 μƒˆλ‘œμš΄ λΉ” 탐색(beam search)이 μ‚¬μš©λ˜μ–΄ λ¬΄ν•œν•œ μž…λ ₯ μ˜€λ””μ˜€μ— λŒ€ν•΄ λ””μ½”λ”©(decoding)을 μˆ˜ν–‰ν•  수 μžˆλ‹€. 이 λ””μ½”λ”© λ°©μ‹μ—λŠ” μƒˆλ‘œμš΄ λΉ” κ°€μ§€μΉ˜κΈ°(beam pruning) μ•Œκ³ λ¦¬μ¦˜μ΄ λ„μž…λ˜μ–΄ 트리 ꡬ쑰의 크기가 μ§€μˆ˜μ μœΌλ‘œ μ¦κ°€ν•˜λŠ” 것을 λ°©μ§€ν•œλ‹€. μœ„ μŒμ„±μΈμ‹ λͺ¨λΈμ—λŠ” λ³„λ„μ˜ μŒμ†Œ λͺ¨λΈμ΄λ‚˜ 발음 사전이 ν¬ν•¨λ˜μ–΄ μžˆμ§€ μ•Šκ³ , λ¬΄ν•œνžˆ κΈ΄ 일련의 μ˜€λ””μ˜€μ— λŒ€ν•΄ 디코딩을 μˆ˜ν–‰ν•  수 μžˆλ‹€λŠ” νŠΉμ§•μ΄ μžˆλ‹€. μœ„ λͺ¨λΈμ€ λ˜ν•œ λ‹€λ₯Έ 쒅단간 λͺ¨λΈμ— λΉ„ν•΄ 맀우 적은 λ©”λͺ¨λ¦¬λ₯Ό μ‚¬μš©ν•˜λ©΄μ„œλ„ 비견될 λ§Œν•œ 정확도λ₯Ό 보인닀. λ§ˆμ§€λ§‰μœΌλ‘œ, λ³Έ λ…Όλ¬Έμ—μ„œλŠ” κ³„μΈ΅ν˜• ꡬ쑰(hierarchical structure)λ₯Ό μ΄μš©ν•΄ κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ˜ μ„±λŠ₯을 ν–₯μƒμ‹œμΌ°λ‹€. 특히, 이 κΈ€μž λ‹¨μœ„ RNN λͺ¨λΈμ€ λΉ„μŠ·ν•œ νŒŒλΌλ―Έν„° 수λ₯Ό κ°–λŠ” 단어 λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈλ³΄λ‹€ κ°œμ„ λœ 예츑 λ³΅μž‘λ„(perplexity)λ₯Ό λ‹¬μ„±ν•˜μ˜€λ‹€. λ˜ν•œ, 이 κΈ€μž λ‹¨μœ„ RNN μ–Έμ–΄ λͺ¨λΈμ„ μ•žμ„œ μ„€λͺ…ν•œ κΈ€μž λ‹¨μœ„ μ‹€μ‹œκ°„ μŒμ„±μΈμ‹ μ‹œμŠ€ν…œμ— μ μš©ν•˜μ—¬ λ”μš± 적은 연산을 μ‚¬μš©ν•˜λ©΄μ„œλ„ μŒμ„±μΈμ‹ 정확도λ₯Ό ν–₯μƒμ‹œν‚¬ 수 μžˆμ—ˆλ‹€.Recurrent neural networks (RNNs) have shown outstanding sequence to sequence modeling performance. Thanks to recent advances in end-to-end training approaches for automatic speech recognition (ASR), RNNs can learn direct mapping functions from the sequence of audio features to the sequence of output characters or words without any intermediate phoneme or lexicon layers. So far, majority of studies on end-to-end ASR have been focused on increasing the accuracy of speech recognition to the level of traditional state-of-the-art models. However, although the end-to-end ASR models have reached the accuracy of the traditional systems, their application has usually been limited to utterance-level speech recognition with pre-segmented audio instead of online speech recognition with continuous audio. This is because the RNNs cannot be easily generalized to very long streams of audio when they are trained with segmented audio. To address this problem, we propose an RNN training approach on training sequences with virtually infinite length. Specifically, we describe an efficient GPU-based RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training. Then, we present an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window. This modified CTC algorithm can be directly employed for truncated BPTT based RNN training. In addition, a fully RNN based end-to-end online ASR model is proposed. The model is composed of an acoustic RNN with CTC output and a character-level RNN language model that is augmented with a hierarchical structure. Prefix-tree based beam search decoding is employed with a new beam pruning algorithm to prevent exponential growth of the tree. The model is free from phoneme or lexicon models, and can be used for decoding infinitely long audio sequences. Also, this model has very small memory footprint compared to the other end-to-end systems while showing the competitive accuracy. Furthermore, we propose an improved character-level RNN LM with a hierarchical structure. This character-level RNN LM shows improved perplexity compared to the lightweight word-level RNN LM with a comparable size. When this RNN LM is applied to the proposed character-level online ASR, better speech recognition accuracy can be achieved with reduced amount of computation.1 Introduction 1 1.1 Automatic Speech Recognition 1 1.1.1 Traditional ASR 2 1.1.2 End-to-End ASR with Recurrent Neural Networks 3 1.1.3 Offline and Online ASR 3 1.2 Scope of the Dissertation 4 1.2.1 End-to-End Online ASR with RNNs 4 1.2.2 Challenges and Contributions 5 2 Flexible and Efficient RNN Training on GPUs 7 2.1 Introduction 7 2.2 Generalization 9 2.2.1 Generalized RNN Structure 9 2.2.2 Training 11 2.3 Parallelization 15 2.3.1 Intra-Stream Parallelism 15 2.3.2 Inter-Stream Parallelism 17 2.4 Experiments 18 2.5 Concluding Remarks 21 3 Online Sequence Training with Connectionist Temporal Classification 22 3.1 Introduction 22 3.2 Connectionist Temporal Classification 25 3.3 Online Sequence Training 28 3.3.1 Problem Definition 28 3.3.2 Overview of the Proposed Approach 29 3.3.3 CTC-TR: Standard CTC with Truncation 31 3.3.4 CTC-EM: EM-Based Online CTC 32 3.4 Training Continuously Running RNNs 37 3.5 Parallel Training 38 3.6 Experiments 39 3.6.1 End-to-End Speech Recognition with RNNs 39 3.6.2 Phoneme Recognition on TIMIT 46 3.7 Concluding Remarks 51 4 Character-Level Incremental Speech Recognition 52 4.1 Introduction 52 4.2 Models 54 4.2.1 Acoustic Model 54 4.2.2 Language Model 56 4.3 Character-Level Beam Search 57 4.3.1 Prefix-Tree-Based CTC Beam Search 57 4.3.2 Pruning 60 4.4 Experiments 62 4.5 Concluding Remarks 65 5 Character-Level Language Modeling with Hierarchical RNNs 66 5.1 Introduction 66 5.2 Related Work 68 5.2.1 Character-Level Language Modeling with RNNs 68 5.2.2 Character-Aware Word-Level Language Modeling 69 5.3 RNNs with External Clock and Reset Signals 70 5.4 Character-Level Language Modeling with a Hierarchical RNN 72 5.5 Experiments 75 5.5.1 Perplexity 76 5.5.2 End-to-End Automatic Speech Recognition (ASR) 79 5.6 Concluding Remarks 81 6 Conclusion 83 Bibliography 85 Abstract in Korean 98Docto

    An autoencoder compression approach for accelerating large-scale inverse problems

    Full text link
    PDE-constrained inverse problems are some of the most challenging and computationally demanding problems in computational science today. Fine meshes that are required to accurately compute the PDE solution introduce an enormous number of parameters and require large scale computing resources such as more processors and more memory to solve such systems in a reasonable time. For inverse problems constrained by time dependent PDEs, the adjoint method that is often employed to efficiently compute gradients and higher order derivatives requires solving a time-reversed, so-called adjoint PDE that depends on the forward PDE solution at each timestep. This necessitates the storage of a high dimensional forward solution vector at every timestep. Such a procedure quickly exhausts the available memory resources. Several approaches that trade additional computation for reduced memory footprint have been proposed to mitigate the memory bottleneck, including checkpointing and compression strategies. In this work, we propose a close-to-ideal scalable compression approach using autoencoders to eliminate the need for checkpointing and substantial memory storage, thereby reducing both the time-to-solution and memory requirements. We compare our approach with checkpointing and an off-the-shelf compression approach on an earth-scale ill-posed seismic inverse problem. The results verify the expected close-to-ideal speedup for both the gradient and Hessian-vector product using the proposed autoencoder compression approach. To highlight the usefulness of the proposed approach, we combine the autoencoder compression with the data-informed active subspace (DIAS) prior to show how the DIAS method can be affordably extended to large scale problems without the need of checkpointing and large memory

    Tiny Deep Learning Architectures Enabling Sensor-Near Acoustic Data Processing and Defect Localization

    Get PDF
    The timely diagnosis of defects at their incipient stage of formation is crucial to extending the life-cycle of technical appliances. This is the case of mechanical-related stress, either due to long aging degradation processes (e.g., corrosion) or in-operation forces (e.g., impact events), which might provoke detrimental damage, such as cracks, disbonding or delaminations, most commonly followed by the release of acoustic energy. The localization of these sources can be successfully fulfilled via adoption of acoustic emission (AE)-based inspection techniques through the computation of the time of arrival (ToA), namely the time at which the induced mechanical wave released at the occurrence of the acoustic event arrives to the acquisition unit. However, the accurate estimation of the ToA may be hampered by poor signal-to-noise ratios (SNRs). In these conditions, standard statistical methods typically fail. In this work, two alternative deep learning methods are proposed for ToA retrieval in processing AE signals, namely a dilated convolutional neural network (DilCNN) and a capsule neural network for ToA (CapsToA). These methods have the additional benefit of being portable on resource-constrained microprocessors. Their performance has been extensively studied on both synthetic and experimental data, focusing on the problem of ToA identification for the case of a metallic plate. Results show that the two methods can achieve localization errors which are up to 70% more precise than those yielded by conventional strategies, even when the SNR is severely compromised (i.e., down to 2 dB). Moreover, DilCNN and CapsNet have been implemented in a tiny machine learning environment and then deployed on microcontroller units, showing a negligible loss of performance with respect to offline realizations

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

    Full text link
    Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe
    • …
    corecore