518 research outputs found

    End-to-End Neural Network-based Speech Recognition for Mobile and Embedded Devices

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interest in recent years. Deep neural network-based automatic speech recognition demands a large number of computations, while the memory bandwidth and power storage of mobile devices are limited. The server-based implementation is often employed, but this increases latency or privacy concerns. Therefore, the need of the on-device ASR system is increasing. Recurrent neural networks (RNNs) are often used for the ASR model. The RNN implementation on embedded devices can suffer from excessive DRAM accesses, because the parameter size of a neural network usually exceeds that of the cache memory. Also, the parameters of RNN cannot be reused for multiple time-steps due to its feedback structure. To solve this problem, multi-time step parallelizable models are applied for speech recognition. The multi-time step parallelization approach computes multiple output samples at a time with the parameters fetched from the DRAM. Since the number of DRAM accesses can be reduced in proportion to the number of parallelization steps, a high processing speed can be achieved for the parallelizable model. In this thesis, a connectionist temporal classification (CTC) model is constructed by combining simple recurrent units (SRUs) and depth-wise 1-dimensional convolution layers for multi-time step parallelization. Both the character and word piece models are developed for the CTC model, and the corresponding RNN based language models are used for beam search decoding. A competitive WER for WSJ corpus is achieved using the entire model size of approximately 15MB. The system operates in real-time speed using only a single core ARM without GPU or special hardware. A low-latency on-device speech recognition system with a simple gated convolutional network (SGCN) is also proposed. The SGCN shows a competitive recognition accuracy even with 1M parameters. 8-bit quantization is applied to reduce the memory size and computation time. The proposed system features an online recognition with a 0.4s latency limit and operates in 0.2 RTF with only a single 900MHz CPU core. In addition, an attention-based model with the depthwise convolutional encoder is proposed. Convolutional encoders enable faster training and inference of attention models than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length often suffers from looping or skipping problems. We believe that this is due to the time-invariance of convolutions. We attempt to remedy this issue by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of incorporating positional information. The streaming end-to-end ASR model is also developed by applying monotonic chunkwise attention.์ตœ๊ทผ ๋ชจ๋ฐ”์ผ ๋ฐ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์‹ค์‹œ๊ฐ„ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด ํฐ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊นŠ์€ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์Œ์„ฑ์ธ์‹์€ ๋งŽ์€ ์–‘์˜ ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๋ฐ˜๋ฉด, ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด๋‚˜ ์ „๋ ฅ์€ ์ œํ•œ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„ ๋•Œ๋ฌธ์— ์„œ๋ฒ„ ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์ด ๋ณดํ†ต ์‚ฌ์šฉ๋˜์–ด์ง€์ง€๋งŒ, ์ด๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฐ ์‚ฌ์ƒํ™œ ์นจํ•ด ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚จ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ ์ƒ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์˜ ์š”๊ตฌ๊ฐ€ ์ปค์ง€๊ณ  ์žˆ๋‹ค. ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์€ ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์ด๋‹ค. ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ๋ณดํ†ต ์บ์‹œ์˜ ํฌ๊ธฐ๋ณด๋‹ค ํฌ๊ณ  ํ”ผ๋“œ๋ฐฑ ๊ตฌ์กฐ ๋•Œ๋ฌธ์— ์žฌ์‚ฌ์šฉ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ DRAM ์ ‘๊ทผ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ์‹œ๊ฐ„์˜ ์ž…๋ ฅ์—๋Œ€ํ•ด ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹ค์ค‘ ์‹œ๊ฐ„ ๋ณ‘๋ ฌํ™” ๊ธฐ๋ฒ•์€ ํ•œ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์—ฌ๋Ÿฌ ์‹œ๊ฐ„์˜ ์ถœ๋ ฅ์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋ณ‘๋ ฌํ™” ์ˆ˜์— ๋”ฐ๋ผ DRAM ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ๋น ๋ฅธ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹จ์ˆœ ์žฌ๊ท€ ์œ ๋‹›๊ณผ 1์ฐจ์› ์ปจ๋ฒŒ๋ฃจ์…˜์„ ์ด์šฉํ•œ CTC ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ฌธ์ž์™€ ๋‹จ์–ด ์กฐ๊ฐ ์ˆ˜์ค€์˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ๊ฐ ์ถœ๋ ฅ ๋‹จ์œ„์— ํ•ด๋‹นํ•˜๋Š” ์žฌ๊ท€ํ˜• ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๋””์ฝ”๋”ฉ์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์ „์ฒด 15MB์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋กœ WSJ ์—์„œ ๋†’์€ ์ˆ˜์ค€์˜ ์ธ์‹ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์œผ๋ฉฐ GPU๋‚˜ ๊ธฐํƒ€ ํ•˜๋“œ์›จ์–ด ์—†์ด 1๊ฐœ์˜ ARM CPU ์ฝ”์–ด๋กœ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹จ์ˆœ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง (SGCN)์„ ์ด์šฉํ•œ ๋‚ฎ์€ ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋Š” ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. SGCN์€ 1M์˜ ๋งค์šฐ ๋‚ฎ์€ ๋ณ€์ˆ˜ ๊ฐฏ์ˆ˜๋กœ๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ธ์‹ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ 8-bit ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๊ฐ์†Œ ์‹œ์ผฐ๋‹ค. ํ•ด๋‹น ์‹œ์Šคํ…œ์€ 0.4์ดˆ์˜ ์ด๋ก ์  ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋ฉฐ 900MHz์˜ CPU ์ƒ์—์„œ 0.2์˜ RTF๋กœ ๋™์ž‘ํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ๊นŠ์ด๋ณ„ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋ฅผ ์ด์šฉํ•œ ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๋Š” ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๊ฐ€์ง„๋‹ค. ํ•˜์ง€๋งŒ ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์€ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ ํฐ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ํฌ๊ธฐ ๋ฐ ์—ฐ์‚ฐ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ๋™์ž‘ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋ฅผ ์ฆ๊ฐ€ ์‹œํ‚จ๋‹ค. ์ž‘์€ ํฌ๊ธฐ์˜ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋Š” ์ถœ๋ ฅ์˜ ๋ฐ˜๋ณต์ด๋‚˜ ์ƒ๋žต์œผ๋กœ ์ธํ•˜์—ฌ ๋†’์€ ์˜ค์ฐจ์œจ์„ ๊ฐ€์ง„๋‹ค. ์ด๊ฒƒ์€ ์ปจ๋ฒŒ๋ฃจ์…˜์˜ ์‹œ๊ฐ„ ๋ถˆ๋ณ€์„ฑ ๋•Œ๋ฌธ์œผ๋กœ ์—ฌ๊ฒจ์ง€๋ฉฐ, ์ด ๋ฌธ์ œ๋ฅผ ์œ„์น˜ ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์œ„์น˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘์€ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๊ฐ€์ง€๋Š” ์˜ํ–ฅ์„ ์‹œ๊ฐํ™” ํ•˜์˜€๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ๋‹จ์กฐ ์–ดํ…์…˜์„ ์ด์šฉํ•œ ๋ชจ๋ธ์— ํ™œ์šฉํ•˜์—ฌ ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.1 Introduction 1 1.1 End-to-End Automatic Speech Recognition with Neural Networks . . 1 1.2 Challenges on On-device Implementation of Neural Network-based ASR 2 1.3 Parallelizable Neural Network Architecture 3 1.4 Scope of Dissertation 3 2 Simple Recurrent Units for CTC-based End-to-End Speech Recognition 6 2.1 Introduction 6 2.2 Related Works 8 2.3 Speech Recognition Algorithm 9 2.3.1 Acoustic modeling 10 2.3.2 Character-based model 12 2.3.3 Word piece-based model 14 2.3.4 Decoding 14 2.4 Experimental Results 15 2.4.1 Acoustic models 15 2.4.2 Word piece based speech recognition 22 2.4.3 Execution time analysis 25 2.5 Concluding Remarks 27 3 Low-Latency Lightweight Streaming Speech Recognition with 8-bit Quantized Depthwise Gated Convolutional Neural Networks 28 3.1 Introduction 28 3.2 Simple Gated Convolutional Networks 30 3.2.1 Model structure 30 3.2.2 Multi-time-step parallelization 31 3.3 Training CTC AM with SGCN 34 3.3.1 Regularization with symmetrical weight noise injection 34 3.3.2 8-bit quantization 34 3.4 Experimental Results 36 3.4.1 Experimental setting 36 3.4.2 Results on WSJ eval92 38 3.4.3 Implementation on the embedded system 38 3.5 Concluding Remarks 39 4 Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition 41 4.1 Introduction 41 4.2 Related Works 43 4.3 Model Description 45 4.4 Experimental Results 46 4.4.1 Effect of receptive field size 46 4.4.2 Visualization 49 4.4.3 Comparison with other models 53 4.5 Concluding Remarks 53 5 Convolution-based Attention Model with Positional Encoding for Streaming Speech Recognition 55 5.1 Introduction 55 5.2 Related Works 58 5.3 End-to-End Model for Speech Recognition 61 5.3.1 Model description 61 5.3.2 Monotonic chunkwise attention 62 5.3.3 Positional encoding 63 5.4 Experimental Results 64 5.4.1 Effect of positional encoding 66 5.4.2 Comparison with other models 68 5.4.3 Execution time analysis 70 5.5 Concluding Remarks 71 6 Conclusion 72 Abstract (In Korean) 86Docto

    A rocket-borne data-manipulation experiment using a microprocessor

    Get PDF
    The development of a data-manipulation experiment using a Z-80 microprocessor is described. The instrumentation is included in the payloads of two Nike Apache sounding rockets used in an investigation of energetic particle fluxes. The data from an array of solid-state detectors and an electrostatic analyzer is processed to give the energy spectrum as a function of pitch angle. The experiment performed well in its first flight test: Nike Apache 14.543 was launched from Wallops Island at 2315 EST on 19 June 1978. The system was designed to be easily adaptable to other data-manipulation requirements and some suggestions for further development are included

    Turbo decoder VLSI implementations for multi-standards wireless communication systems

    Get PDF

    ESPnet-ONNX: Bridging a Gap Between Research and Production

    Full text link
    In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks. In contrast, application developers are interested in making models suitable for actual products, which involves optimizing a model for faster inference and adapting a model to various platforms (e.g., C++ and Python). In this work, to fill the gap between the two, we establish an effective procedure for optimizing a PyTorch-based research-oriented model for deployment, taking ESPnet, a widely used toolkit for speech processing, as an instance. We introduce different techniques to ESPnet, including converting a model into an ONNX format, fusing nodes in a graph, and quantizing parameters, which lead to approximately 1.3-2ร—\times speedup in various tasks (i.e., ASR, TTS, speech translation, and spoken language understanding) while keeping its performance without any additional training. Our ESPnet-ONNX will be publicly available at https://github.com/espnet/espnet_onnxComment: Accepted to APSIPA ASC 202

    Context- and Template-Based Compression for Efficient Management of Data Models in Resource-Constrained Systems

    Get PDF
    The Cyber Physical Systems (CPS) paradigm is based on the deployment of interconnected heterogeneous devices and systems, so interoperability is at the heart of any CPS architecture design. In this sense, the adoption of standard and generic data formats for data representation and communication, e.g., XML or JSON, effectively addresses the interoperability problem among heterogeneous systems. Nevertheless, the verbosity of those standard data formats usually demands system resources that might suppose an overload for the resource-constrained devices that are typically deployed in CPS. In this work we present Context-and Template-based Compression (CTC), a data compression approach targeted to resource-constrained devices, which allows reducing the resources needed to transmit, store and process data models. Additionally, we provide a benchmark evaluation and comparison with current implementations of the Efficient XML Interchange (EXI) processor, which is promoted by the World Wide Web Consortium (W3C), and it is the most prominent XML compression mechanism nowadays. Interestingly, the results from the evaluation show that CTC outperforms EXI implementations in terms of memory usage and speed, keeping similar compression rates. As a conclusion, CTC is shown to be a good candidate for managing standard data model representation formats in CPS composed of resource-constrained devices.Research partially supported by the European Union Horizon 2020 Programme under Grant Agreement Number H2020-EeB-2015/680708 - HIT2GAP, Highly Innovative building control Tools Tackling the energy performance GAP. Also partially supported by the Department of Education, Universities and Research of the Basque Government under Grant IT980-16 and the Spanish Research Council, under grant TIN2016-79897-P

    Design of a GF(64)-LDPC Decoder Based on the EMS Algorithm

    No full text
    International audienceThis paper presents the architecture, performance and implementation results of a serial GF(64)-LDPC decoder based on a reduced-complexity version of the Extended Min-Sum algorithm. The main contributions of this work correspond to the variable node processing, the codeword decision and the elementary check node processing. Post-synthesis area results show that the decoder area is less than 20% of a Virtex 4 FPGA for a decoding throughput of 2.95 Mbps. The implemented decoder presents performance at less than 0.7 dB from the Belief Propagation algorithm for different code lengths and rates. Moreover, the proposed architecture can be easily adapted to decode very high Galois Field orders, such as GF(4096) or higher, by slightly modifying a marginal part of the design
    • โ€ฆ
    corecore