655 research outputs found

    A Study of All-Convolutional Encoders for Connectionist Temporal Classification

    Full text link
    Connectionist temporal classification (CTC) is a popular sequence prediction approach for automatic speech recognition that is typically used with models based on recurrent neural networks (RNNs). We explore whether deep convolutional neural networks (CNNs) can be used effectively instead of RNNs as the "encoder" in CTC. CNNs lack an explicit representation of the entire sequence, but have the advantage that they are much faster to train. We present an exploration of CNNs as encoders for CTC models, in the context of character-based (lexicon-free) automatic speech recognition. In particular, we explore a range of one-dimensional convolutional layers, which are particularly efficient. We compare the performance of our CNN-based models against typical RNNbased models in terms of training time, decoding time, model size and word error rate (WER) on the Switchboard Eval2000 corpus. We find that our CNN-based models are close in performance to LSTMs, while not matching them, and are much faster to train and decode.Comment: Accepted to ICASSP-201

    Making Good on LSTMs' Unfulfilled Promise

    Get PDF
    LSTMs promise much to financial time-series analysis, temporal and cross-sectional inference, but we find that they do not deliver in a real-world financial management task. We examine an alternative called Continual Learning (CL), a memory-augmented approach, which can provide transparent explanations, i.e. which memory did what and when. This work has implications for many financial applications including credit, time-varying fairness in decision making and more. We make three important new observations. Firstly, as well as being more explainable, time-series CL approaches outperform LSTMs as well as a simple sliding window learner using feed-forward neural networks (FFNN). Secondly, we show that CL based on a sliding window learner (FFNN) is more effective than CL based on a sequential learner (LSTM). Thirdly, we examine how real-world, time-series noise impacts several similarity approaches used in CL memory addressing. We provide these insights using an approach called Continual Learning Augmentation (CLA) tested on a complex real-world problem, emerging market equities investment decision making. CLA provides a test-bed as it can be based on different types of time-series learners, allowing testing of LSTM and FFNN learners side by side. CLA is also used to test several distance approaches used in a memory recall-gate: Euclidean distance (ED), dynamic time warping (DTW), auto-encoders (AE) and a novel hybrid approach, warp-AE. We find that ED under-performs DTW and AE but warp-AE shows the best overall performance in a real-world financial task

    Representation Learning: A Review and New Perspectives

    Full text link
    The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning

    End-to-End Neural Network-based Speech Recognition for Mobile and Embedded Devices

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interest in recent years. Deep neural network-based automatic speech recognition demands a large number of computations, while the memory bandwidth and power storage of mobile devices are limited. The server-based implementation is often employed, but this increases latency or privacy concerns. Therefore, the need of the on-device ASR system is increasing. Recurrent neural networks (RNNs) are often used for the ASR model. The RNN implementation on embedded devices can suffer from excessive DRAM accesses, because the parameter size of a neural network usually exceeds that of the cache memory. Also, the parameters of RNN cannot be reused for multiple time-steps due to its feedback structure. To solve this problem, multi-time step parallelizable models are applied for speech recognition. The multi-time step parallelization approach computes multiple output samples at a time with the parameters fetched from the DRAM. Since the number of DRAM accesses can be reduced in proportion to the number of parallelization steps, a high processing speed can be achieved for the parallelizable model. In this thesis, a connectionist temporal classification (CTC) model is constructed by combining simple recurrent units (SRUs) and depth-wise 1-dimensional convolution layers for multi-time step parallelization. Both the character and word piece models are developed for the CTC model, and the corresponding RNN based language models are used for beam search decoding. A competitive WER for WSJ corpus is achieved using the entire model size of approximately 15MB. The system operates in real-time speed using only a single core ARM without GPU or special hardware. A low-latency on-device speech recognition system with a simple gated convolutional network (SGCN) is also proposed. The SGCN shows a competitive recognition accuracy even with 1M parameters. 8-bit quantization is applied to reduce the memory size and computation time. The proposed system features an online recognition with a 0.4s latency limit and operates in 0.2 RTF with only a single 900MHz CPU core. In addition, an attention-based model with the depthwise convolutional encoder is proposed. Convolutional encoders enable faster training and inference of attention models than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length often suffers from looping or skipping problems. We believe that this is due to the time-invariance of convolutions. We attempt to remedy this issue by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of incorporating positional information. The streaming end-to-end ASR model is also developed by applying monotonic chunkwise attention.์ตœ๊ทผ ๋ชจ๋ฐ”์ผ ๋ฐ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์‹ค์‹œ๊ฐ„ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด ํฐ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊นŠ์€ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์Œ์„ฑ์ธ์‹์€ ๋งŽ์€ ์–‘์˜ ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๋ฐ˜๋ฉด, ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด๋‚˜ ์ „๋ ฅ์€ ์ œํ•œ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„ ๋•Œ๋ฌธ์— ์„œ๋ฒ„ ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์ด ๋ณดํ†ต ์‚ฌ์šฉ๋˜์–ด์ง€์ง€๋งŒ, ์ด๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฐ ์‚ฌ์ƒํ™œ ์นจํ•ด ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚จ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ ์ƒ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์˜ ์š”๊ตฌ๊ฐ€ ์ปค์ง€๊ณ  ์žˆ๋‹ค. ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์€ ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์ด๋‹ค. ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ๋ณดํ†ต ์บ์‹œ์˜ ํฌ๊ธฐ๋ณด๋‹ค ํฌ๊ณ  ํ”ผ๋“œ๋ฐฑ ๊ตฌ์กฐ ๋•Œ๋ฌธ์— ์žฌ์‚ฌ์šฉ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ DRAM ์ ‘๊ทผ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ์‹œ๊ฐ„์˜ ์ž…๋ ฅ์—๋Œ€ํ•ด ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹ค์ค‘ ์‹œ๊ฐ„ ๋ณ‘๋ ฌํ™” ๊ธฐ๋ฒ•์€ ํ•œ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์—ฌ๋Ÿฌ ์‹œ๊ฐ„์˜ ์ถœ๋ ฅ์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋ณ‘๋ ฌํ™” ์ˆ˜์— ๋”ฐ๋ผ DRAM ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ๋น ๋ฅธ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹จ์ˆœ ์žฌ๊ท€ ์œ ๋‹›๊ณผ 1์ฐจ์› ์ปจ๋ฒŒ๋ฃจ์…˜์„ ์ด์šฉํ•œ CTC ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ฌธ์ž์™€ ๋‹จ์–ด ์กฐ๊ฐ ์ˆ˜์ค€์˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ๊ฐ ์ถœ๋ ฅ ๋‹จ์œ„์— ํ•ด๋‹นํ•˜๋Š” ์žฌ๊ท€ํ˜• ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๋””์ฝ”๋”ฉ์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์ „์ฒด 15MB์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋กœ WSJ ์—์„œ ๋†’์€ ์ˆ˜์ค€์˜ ์ธ์‹ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์œผ๋ฉฐ GPU๋‚˜ ๊ธฐํƒ€ ํ•˜๋“œ์›จ์–ด ์—†์ด 1๊ฐœ์˜ ARM CPU ์ฝ”์–ด๋กœ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹จ์ˆœ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง (SGCN)์„ ์ด์šฉํ•œ ๋‚ฎ์€ ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋Š” ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. SGCN์€ 1M์˜ ๋งค์šฐ ๋‚ฎ์€ ๋ณ€์ˆ˜ ๊ฐฏ์ˆ˜๋กœ๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ธ์‹ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ 8-bit ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๊ฐ์†Œ ์‹œ์ผฐ๋‹ค. ํ•ด๋‹น ์‹œ์Šคํ…œ์€ 0.4์ดˆ์˜ ์ด๋ก ์  ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋ฉฐ 900MHz์˜ CPU ์ƒ์—์„œ 0.2์˜ RTF๋กœ ๋™์ž‘ํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ๊นŠ์ด๋ณ„ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋ฅผ ์ด์šฉํ•œ ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๋Š” ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๊ฐ€์ง„๋‹ค. ํ•˜์ง€๋งŒ ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์€ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ ํฐ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ํฌ๊ธฐ ๋ฐ ์—ฐ์‚ฐ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ๋™์ž‘ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋ฅผ ์ฆ๊ฐ€ ์‹œํ‚จ๋‹ค. ์ž‘์€ ํฌ๊ธฐ์˜ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋Š” ์ถœ๋ ฅ์˜ ๋ฐ˜๋ณต์ด๋‚˜ ์ƒ๋žต์œผ๋กœ ์ธํ•˜์—ฌ ๋†’์€ ์˜ค์ฐจ์œจ์„ ๊ฐ€์ง„๋‹ค. ์ด๊ฒƒ์€ ์ปจ๋ฒŒ๋ฃจ์…˜์˜ ์‹œ๊ฐ„ ๋ถˆ๋ณ€์„ฑ ๋•Œ๋ฌธ์œผ๋กœ ์—ฌ๊ฒจ์ง€๋ฉฐ, ์ด ๋ฌธ์ œ๋ฅผ ์œ„์น˜ ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์œ„์น˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘์€ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๊ฐ€์ง€๋Š” ์˜ํ–ฅ์„ ์‹œ๊ฐํ™” ํ•˜์˜€๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ๋‹จ์กฐ ์–ดํ…์…˜์„ ์ด์šฉํ•œ ๋ชจ๋ธ์— ํ™œ์šฉํ•˜์—ฌ ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.1 Introduction 1 1.1 End-to-End Automatic Speech Recognition with Neural Networks . . 1 1.2 Challenges on On-device Implementation of Neural Network-based ASR 2 1.3 Parallelizable Neural Network Architecture 3 1.4 Scope of Dissertation 3 2 Simple Recurrent Units for CTC-based End-to-End Speech Recognition 6 2.1 Introduction 6 2.2 Related Works 8 2.3 Speech Recognition Algorithm 9 2.3.1 Acoustic modeling 10 2.3.2 Character-based model 12 2.3.3 Word piece-based model 14 2.3.4 Decoding 14 2.4 Experimental Results 15 2.4.1 Acoustic models 15 2.4.2 Word piece based speech recognition 22 2.4.3 Execution time analysis 25 2.5 Concluding Remarks 27 3 Low-Latency Lightweight Streaming Speech Recognition with 8-bit Quantized Depthwise Gated Convolutional Neural Networks 28 3.1 Introduction 28 3.2 Simple Gated Convolutional Networks 30 3.2.1 Model structure 30 3.2.2 Multi-time-step parallelization 31 3.3 Training CTC AM with SGCN 34 3.3.1 Regularization with symmetrical weight noise injection 34 3.3.2 8-bit quantization 34 3.4 Experimental Results 36 3.4.1 Experimental setting 36 3.4.2 Results on WSJ eval92 38 3.4.3 Implementation on the embedded system 38 3.5 Concluding Remarks 39 4 Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition 41 4.1 Introduction 41 4.2 Related Works 43 4.3 Model Description 45 4.4 Experimental Results 46 4.4.1 Effect of receptive field size 46 4.4.2 Visualization 49 4.4.3 Comparison with other models 53 4.5 Concluding Remarks 53 5 Convolution-based Attention Model with Positional Encoding for Streaming Speech Recognition 55 5.1 Introduction 55 5.2 Related Works 58 5.3 End-to-End Model for Speech Recognition 61 5.3.1 Model description 61 5.3.2 Monotonic chunkwise attention 62 5.3.3 Positional encoding 63 5.4 Experimental Results 64 5.4.1 Effect of positional encoding 66 5.4.2 Comparison with other models 68 5.4.3 Execution time analysis 70 5.5 Concluding Remarks 71 6 Conclusion 72 Abstract (In Korean) 86Docto

    Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

    Full text link
    We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly. Pre-trained models and code will be released online.Comment: Accepted to ICASSP 2020 (oral

    Recurrent Attention Models for Depth-Based Person Identification

    Get PDF
    We present an attention-based model that reasons on human body shape and motion dynamics to identify individuals in the absence of RGB information, hence in the dark. Our approach leverages unique 4D spatio-temporal signatures to address the identification problem across days. Formulated as a reinforcement learning task, our model is based on a combination of convolutional and recurrent neural networks with the goal of identifying small, discriminative regions indicative of human identity. We demonstrate that our model produces state-of-the-art results on several published datasets given only depth images. We further study the robustness of our model towards viewpoint, appearance, and volumetric changes. Finally, we share insights gleaned from interpretable 2D, 3D, and 4D visualizations of our model's spatio-temporal attention.Comment: Computer Vision and Pattern Recognition (CVPR) 201
    • โ€ฆ
    corecore