1,854 research outputs found

    S-TREE: Self-Organizing Trees for Data Clustering and Online Vector Quantization

    Full text link
    This paper introduces S-TREE (Self-Organizing Tree), a family of models that use unsupervised learning to construct hierarchical representations of data and online tree-structured vector quantizers. The S-TREE1 model, which features a new tree-building algorithm, can be implemented with various cost functions. An alternative implementation, S-TREE2, which uses a new double-path search procedure, is also developed. S-TREE2 implements an online procedure that approximates an optimal (unstructured) clustering solution while imposing a tree-structure constraint. The performance of the S-TREE algorithms is illustrated with data clustering and vector quantization examples, including a Gauss-Markov source benchmark and an image compression application. S-TREE performance on these tasks is compared with the standard tree-structured vector quantizer (TSVQ) and the generalized Lloyd algorithm (GLA). The image reconstruction quality with S-TREE2 approaches that of GLA while taking less than 10% of computer time. S-TREE1 and S-TREE2 also compare favorably with the standard TSVQ in both the time needed to create the codebook and the quality of image reconstruction.Office of Naval Research (N00014-95-10409, N00014-95-0G57

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    A quick search method for audio signals based on a piecewise linear representation of feature trajectories

    Full text link
    This paper presents a new method for a quick similarity-based search through long unlabeled audio streams to detect and locate audio clips provided by users. The method involves feature-dimension reduction based on a piecewise linear representation of a sequential feature trajectory extracted from a long audio stream. Two techniques enable us to obtain a piecewise linear representation: the dynamic segmentation of feature trajectories and the segment-based Karhunen-L\'{o}eve (KL) transform. The proposed search method guarantees the same search results as the search method without the proposed feature-dimension reduction method in principle. Experiment results indicate significant improvements in search speed. For example the proposed method reduced the total search time to approximately 1/12 that of previous methods and detected queries in approximately 0.3 seconds from a 200-hour audio database.Comment: 20 pages, to appear in IEEE Transactions on Audio, Speech and Language Processin

    Similarity search in the blink of an eye with compressed indices

    Full text link
    Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem of relevance for a wide range of applications. In this work, we present new techniques for creating faster and smaller indices to run these searches. To this end, we introduce a novel vector compression method, Locally-adaptive Vector Quantization (LVQ), that simultaneously reduces memory footprint and improves search performance, with minimal impact on search accuracy. LVQ is designed to work optimally in conjunction with graph-based indices, reducing their effective bandwidth while enabling random-access-friendly fast similarity computations. Our experimental results show that LVQ, combined with key optimizations for graph-based indices in modern datacenter systems, establishes the new state of the art in terms of performance and memory footprint. For billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the low-memory regime, by up to 20.7x in throughput with up to a 3x memory footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x less memory

    Video content analysis for intelligent forensics

    Get PDF
    The networks of surveillance cameras installed in public places and private territories continuously record video data with the aim of detecting and preventing unlawful activities. This enhances the importance of video content analysis applications, either for real time (i.e. analytic) or post-event (i.e. forensic) analysis. In this thesis, the primary focus is on four key aspects of video content analysis, namely; 1. Moving object detection and recognition, 2. Correction of colours in the video frames and recognition of colours of moving objects, 3. Make and model recognition of vehicles and identification of their type, 4. Detection and recognition of text information in outdoor scenes. To address the first issue, a framework is presented in the first part of the thesis that efficiently detects and recognizes moving objects in videos. The framework targets the problem of object detection in the presence of complex background. The object detection part of the framework relies on background modelling technique and a novel post processing step where the contours of the foreground regions (i.e. moving object) are refined by the classification of edge segments as belonging either to the background or to the foreground region. Further, a novel feature descriptor is devised for the classification of moving objects into humans, vehicles and background. The proposed feature descriptor captures the texture information present in the silhouette of foreground objects. To address the second issue, a framework for the correction and recognition of true colours of objects in videos is presented with novel noise reduction, colour enhancement and colour recognition stages. The colour recognition stage makes use of temporal information to reliably recognize the true colours of moving objects in multiple frames. The proposed framework is specifically designed to perform robustly on videos that have poor quality because of surrounding illumination, camera sensor imperfection and artefacts due to high compression. In the third part of the thesis, a framework for vehicle make and model recognition and type identification is presented. As a part of this work, a novel feature representation technique for distinctive representation of vehicle images has emerged. The feature representation technique uses dense feature description and mid-level feature encoding scheme to capture the texture in the frontal view of the vehicles. The proposed method is insensitive to minor in-plane rotation and skew within the image. The capability of the proposed framework can be enhanced to any number of vehicle classes without re-training. Another important contribution of this work is the publication of a comprehensive up to date dataset of vehicle images to support future research in this domain. The problem of text detection and recognition in images is addressed in the last part of the thesis. A novel technique is proposed that exploits the colour information in the image for the identification of text regions. Apart from detection, the colour information is also used to segment characters from the words. The recognition of identified characters is performed using shape features and supervised learning. Finally, a lexicon based alignment procedure is adopted to finalize the recognition of strings present in word images. Extensive experiments have been conducted on benchmark datasets to analyse the performance of proposed algorithms. The results show that the proposed moving object detection and recognition technique superseded well-know baseline techniques. The proposed framework for the correction and recognition of object colours in video frames achieved all the aforementioned goals. The performance analysis of the vehicle make and model recognition framework on multiple datasets has shown the strength and reliability of the technique when used within various scenarios. Finally, the experimental results for the text detection and recognition framework on benchmark datasets have revealed the potential of the proposed scheme for accurate detection and recognition of text in the wild

    End-to-End Neural Network-based Speech Recognition for Mobile and Embedded Devices

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interest in recent years. Deep neural network-based automatic speech recognition demands a large number of computations, while the memory bandwidth and power storage of mobile devices are limited. The server-based implementation is often employed, but this increases latency or privacy concerns. Therefore, the need of the on-device ASR system is increasing. Recurrent neural networks (RNNs) are often used for the ASR model. The RNN implementation on embedded devices can suffer from excessive DRAM accesses, because the parameter size of a neural network usually exceeds that of the cache memory. Also, the parameters of RNN cannot be reused for multiple time-steps due to its feedback structure. To solve this problem, multi-time step parallelizable models are applied for speech recognition. The multi-time step parallelization approach computes multiple output samples at a time with the parameters fetched from the DRAM. Since the number of DRAM accesses can be reduced in proportion to the number of parallelization steps, a high processing speed can be achieved for the parallelizable model. In this thesis, a connectionist temporal classification (CTC) model is constructed by combining simple recurrent units (SRUs) and depth-wise 1-dimensional convolution layers for multi-time step parallelization. Both the character and word piece models are developed for the CTC model, and the corresponding RNN based language models are used for beam search decoding. A competitive WER for WSJ corpus is achieved using the entire model size of approximately 15MB. The system operates in real-time speed using only a single core ARM without GPU or special hardware. A low-latency on-device speech recognition system with a simple gated convolutional network (SGCN) is also proposed. The SGCN shows a competitive recognition accuracy even with 1M parameters. 8-bit quantization is applied to reduce the memory size and computation time. The proposed system features an online recognition with a 0.4s latency limit and operates in 0.2 RTF with only a single 900MHz CPU core. In addition, an attention-based model with the depthwise convolutional encoder is proposed. Convolutional encoders enable faster training and inference of attention models than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length often suffers from looping or skipping problems. We believe that this is due to the time-invariance of convolutions. We attempt to remedy this issue by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of incorporating positional information. The streaming end-to-end ASR model is also developed by applying monotonic chunkwise attention.์ตœ๊ทผ ๋ชจ๋ฐ”์ผ ๋ฐ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์‹ค์‹œ๊ฐ„ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด ํฐ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊นŠ์€ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์Œ์„ฑ์ธ์‹์€ ๋งŽ์€ ์–‘์˜ ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๋ฐ˜๋ฉด, ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด๋‚˜ ์ „๋ ฅ์€ ์ œํ•œ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„ ๋•Œ๋ฌธ์— ์„œ๋ฒ„ ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์ด ๋ณดํ†ต ์‚ฌ์šฉ๋˜์–ด์ง€์ง€๋งŒ, ์ด๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฐ ์‚ฌ์ƒํ™œ ์นจํ•ด ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚จ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ ์ƒ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์˜ ์š”๊ตฌ๊ฐ€ ์ปค์ง€๊ณ  ์žˆ๋‹ค. ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์€ ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์ด๋‹ค. ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ๋ณดํ†ต ์บ์‹œ์˜ ํฌ๊ธฐ๋ณด๋‹ค ํฌ๊ณ  ํ”ผ๋“œ๋ฐฑ ๊ตฌ์กฐ ๋•Œ๋ฌธ์— ์žฌ์‚ฌ์šฉ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ DRAM ์ ‘๊ทผ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ์‹œ๊ฐ„์˜ ์ž…๋ ฅ์—๋Œ€ํ•ด ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹ค์ค‘ ์‹œ๊ฐ„ ๋ณ‘๋ ฌํ™” ๊ธฐ๋ฒ•์€ ํ•œ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์—ฌ๋Ÿฌ ์‹œ๊ฐ„์˜ ์ถœ๋ ฅ์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋ณ‘๋ ฌํ™” ์ˆ˜์— ๋”ฐ๋ผ DRAM ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ๋น ๋ฅธ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹จ์ˆœ ์žฌ๊ท€ ์œ ๋‹›๊ณผ 1์ฐจ์› ์ปจ๋ฒŒ๋ฃจ์…˜์„ ์ด์šฉํ•œ CTC ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ฌธ์ž์™€ ๋‹จ์–ด ์กฐ๊ฐ ์ˆ˜์ค€์˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ๊ฐ ์ถœ๋ ฅ ๋‹จ์œ„์— ํ•ด๋‹นํ•˜๋Š” ์žฌ๊ท€ํ˜• ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๋””์ฝ”๋”ฉ์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์ „์ฒด 15MB์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋กœ WSJ ์—์„œ ๋†’์€ ์ˆ˜์ค€์˜ ์ธ์‹ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์œผ๋ฉฐ GPU๋‚˜ ๊ธฐํƒ€ ํ•˜๋“œ์›จ์–ด ์—†์ด 1๊ฐœ์˜ ARM CPU ์ฝ”์–ด๋กœ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹จ์ˆœ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง (SGCN)์„ ์ด์šฉํ•œ ๋‚ฎ์€ ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋Š” ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. SGCN์€ 1M์˜ ๋งค์šฐ ๋‚ฎ์€ ๋ณ€์ˆ˜ ๊ฐฏ์ˆ˜๋กœ๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ธ์‹ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ 8-bit ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๊ฐ์†Œ ์‹œ์ผฐ๋‹ค. ํ•ด๋‹น ์‹œ์Šคํ…œ์€ 0.4์ดˆ์˜ ์ด๋ก ์  ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋ฉฐ 900MHz์˜ CPU ์ƒ์—์„œ 0.2์˜ RTF๋กœ ๋™์ž‘ํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ๊นŠ์ด๋ณ„ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋ฅผ ์ด์šฉํ•œ ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๋Š” ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๊ฐ€์ง„๋‹ค. ํ•˜์ง€๋งŒ ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์€ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ ํฐ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ํฌ๊ธฐ ๋ฐ ์—ฐ์‚ฐ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ๋™์ž‘ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋ฅผ ์ฆ๊ฐ€ ์‹œํ‚จ๋‹ค. ์ž‘์€ ํฌ๊ธฐ์˜ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋Š” ์ถœ๋ ฅ์˜ ๋ฐ˜๋ณต์ด๋‚˜ ์ƒ๋žต์œผ๋กœ ์ธํ•˜์—ฌ ๋†’์€ ์˜ค์ฐจ์œจ์„ ๊ฐ€์ง„๋‹ค. ์ด๊ฒƒ์€ ์ปจ๋ฒŒ๋ฃจ์…˜์˜ ์‹œ๊ฐ„ ๋ถˆ๋ณ€์„ฑ ๋•Œ๋ฌธ์œผ๋กœ ์—ฌ๊ฒจ์ง€๋ฉฐ, ์ด ๋ฌธ์ œ๋ฅผ ์œ„์น˜ ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์œ„์น˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘์€ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๊ฐ€์ง€๋Š” ์˜ํ–ฅ์„ ์‹œ๊ฐํ™” ํ•˜์˜€๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ๋‹จ์กฐ ์–ดํ…์…˜์„ ์ด์šฉํ•œ ๋ชจ๋ธ์— ํ™œ์šฉํ•˜์—ฌ ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.1 Introduction 1 1.1 End-to-End Automatic Speech Recognition with Neural Networks . . 1 1.2 Challenges on On-device Implementation of Neural Network-based ASR 2 1.3 Parallelizable Neural Network Architecture 3 1.4 Scope of Dissertation 3 2 Simple Recurrent Units for CTC-based End-to-End Speech Recognition 6 2.1 Introduction 6 2.2 Related Works 8 2.3 Speech Recognition Algorithm 9 2.3.1 Acoustic modeling 10 2.3.2 Character-based model 12 2.3.3 Word piece-based model 14 2.3.4 Decoding 14 2.4 Experimental Results 15 2.4.1 Acoustic models 15 2.4.2 Word piece based speech recognition 22 2.4.3 Execution time analysis 25 2.5 Concluding Remarks 27 3 Low-Latency Lightweight Streaming Speech Recognition with 8-bit Quantized Depthwise Gated Convolutional Neural Networks 28 3.1 Introduction 28 3.2 Simple Gated Convolutional Networks 30 3.2.1 Model structure 30 3.2.2 Multi-time-step parallelization 31 3.3 Training CTC AM with SGCN 34 3.3.1 Regularization with symmetrical weight noise injection 34 3.3.2 8-bit quantization 34 3.4 Experimental Results 36 3.4.1 Experimental setting 36 3.4.2 Results on WSJ eval92 38 3.4.3 Implementation on the embedded system 38 3.5 Concluding Remarks 39 4 Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition 41 4.1 Introduction 41 4.2 Related Works 43 4.3 Model Description 45 4.4 Experimental Results 46 4.4.1 Effect of receptive field size 46 4.4.2 Visualization 49 4.4.3 Comparison with other models 53 4.5 Concluding Remarks 53 5 Convolution-based Attention Model with Positional Encoding for Streaming Speech Recognition 55 5.1 Introduction 55 5.2 Related Works 58 5.3 End-to-End Model for Speech Recognition 61 5.3.1 Model description 61 5.3.2 Monotonic chunkwise attention 62 5.3.3 Positional encoding 63 5.4 Experimental Results 64 5.4.1 Effect of positional encoding 66 5.4.2 Comparison with other models 68 5.4.3 Execution time analysis 70 5.5 Concluding Remarks 71 6 Conclusion 72 Abstract (In Korean) 86Docto
    • โ€ฆ
    corecore