1,854 research outputs found
S-TREE: Self-Organizing Trees for Data Clustering and Online Vector Quantization
This paper introduces S-TREE (Self-Organizing Tree), a family of models that use unsupervised learning to construct hierarchical representations of data and online tree-structured vector quantizers. The S-TREE1 model, which features a new tree-building algorithm, can be implemented with various cost functions. An alternative implementation, S-TREE2, which uses a new double-path search procedure, is also developed. S-TREE2 implements an online procedure that approximates an optimal (unstructured) clustering solution while imposing a tree-structure constraint. The performance of the S-TREE algorithms is illustrated with data clustering and vector quantization examples, including a Gauss-Markov source benchmark and an image compression application. S-TREE performance on these tasks is compared with the standard tree-structured vector quantizer (TSVQ) and the generalized Lloyd algorithm (GLA). The image reconstruction quality with S-TREE2 approaches that of GLA while taking less than 10% of computer time. S-TREE1 and S-TREE2 also compare favorably with the standard TSVQ in both the time needed to create the codebook and the quality of image reconstruction.Office of Naval Research (N00014-95-10409, N00014-95-0G57
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
A quick search method for audio signals based on a piecewise linear representation of feature trajectories
This paper presents a new method for a quick similarity-based search through
long unlabeled audio streams to detect and locate audio clips provided by
users. The method involves feature-dimension reduction based on a piecewise
linear representation of a sequential feature trajectory extracted from a long
audio stream. Two techniques enable us to obtain a piecewise linear
representation: the dynamic segmentation of feature trajectories and the
segment-based Karhunen-L\'{o}eve (KL) transform. The proposed search method
guarantees the same search results as the search method without the proposed
feature-dimension reduction method in principle. Experiment results indicate
significant improvements in search speed. For example the proposed method
reduced the total search time to approximately 1/12 that of previous methods
and detected queries in approximately 0.3 seconds from a 200-hour audio
database.Comment: 20 pages, to appear in IEEE Transactions on Audio, Speech and
Language Processin
Similarity search in the blink of an eye with compressed indices
Nowadays, data is represented by vectors. Retrieving those vectors, among
millions and billions, that are similar to a given query is a ubiquitous
problem of relevance for a wide range of applications. In this work, we present
new techniques for creating faster and smaller indices to run these searches.
To this end, we introduce a novel vector compression method, Locally-adaptive
Vector Quantization (LVQ), that simultaneously reduces memory footprint and
improves search performance, with minimal impact on search accuracy. LVQ is
designed to work optimally in conjunction with graph-based indices, reducing
their effective bandwidth while enabling random-access-friendly fast similarity
computations. Our experimental results show that LVQ, combined with key
optimizations for graph-based indices in modern datacenter systems, establishes
the new state of the art in terms of performance and memory footprint. For
billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the
low-memory regime, by up to 20.7x in throughput with up to a 3x memory
footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x
less memory
Video content analysis for intelligent forensics
The networks of surveillance cameras installed in public places and private territories continuously record video data with the aim of detecting and preventing unlawful activities. This enhances the importance of video content analysis applications, either for real time (i.e. analytic) or post-event (i.e. forensic) analysis. In this thesis, the primary focus is on four key aspects of video content analysis, namely; 1. Moving object detection and recognition, 2. Correction of colours in the video frames and recognition of colours of moving objects, 3. Make and model recognition of vehicles and identification of their type, 4. Detection and recognition of text information in outdoor scenes.
To address the first issue, a framework is presented in the first part of the thesis that efficiently detects and recognizes moving objects in videos. The framework targets the problem of object detection in the presence of complex background. The object detection part of the framework relies on background modelling technique and a novel post processing step where the contours of the foreground regions (i.e. moving object) are refined by the classification of edge segments as belonging either to the background or to the foreground region. Further, a novel feature descriptor is devised for the classification of moving objects into humans, vehicles and background. The proposed feature descriptor captures the texture information present in the silhouette of foreground objects.
To address the second issue, a framework for the correction and recognition of true colours of objects in videos is presented with novel noise reduction, colour enhancement and colour recognition stages. The colour recognition stage makes use of temporal information to reliably recognize the true colours of moving objects in multiple frames. The proposed framework is specifically designed to perform robustly on videos that have poor quality because of surrounding illumination, camera sensor imperfection and artefacts due to high compression.
In the third part of the thesis, a framework for vehicle make and model recognition and type identification is presented. As a part of this work, a novel feature representation technique for distinctive representation of vehicle images has emerged. The feature representation technique uses dense feature description and mid-level feature encoding scheme to capture the texture in the frontal view of the vehicles. The proposed method is insensitive to minor in-plane rotation and skew within the image. The capability of the proposed framework can be enhanced to any number of vehicle classes without re-training. Another important contribution of this work is the publication of a comprehensive up to date dataset of vehicle images to support future research in this domain.
The problem of text detection and recognition in images is addressed in the last part of the thesis. A novel technique is proposed that exploits the colour information in the image for the identification of text regions. Apart from detection, the colour information is also used to segment characters from the words. The recognition of identified characters is performed using shape features and supervised learning. Finally, a lexicon based alignment procedure is adopted to finalize the recognition of strings present in word images.
Extensive experiments have been conducted on benchmark datasets to analyse the performance of proposed algorithms. The results show that the proposed moving object detection and recognition technique superseded well-know baseline techniques. The proposed framework for the correction and recognition of object colours in video frames achieved all the aforementioned goals. The performance analysis of the vehicle make and model recognition framework on multiple datasets has shown the strength and reliability of the technique when used within various scenarios. Finally, the experimental results for the text detection and recognition framework on benchmark datasets have revealed the potential of the proposed scheme for accurate detection and recognition of text in the wild
End-to-End Neural Network-based Speech Recognition for Mobile and Embedded Devices
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2020. 8. ์ฑ์์ฉ.Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interest in recent years. Deep neural network-based automatic speech recognition demands a large number of computations, while the memory bandwidth and power storage of mobile devices are limited. The server-based implementation is often employed, but this increases latency or privacy concerns. Therefore, the need of the on-device ASR system is increasing. Recurrent neural networks (RNNs) are often used for the ASR model. The RNN implementation on embedded devices can suffer from excessive DRAM accesses, because the parameter size of a neural network usually exceeds that of the cache memory. Also, the parameters of RNN cannot be reused for multiple time-steps due to its feedback structure. To solve this problem, multi-time step parallelizable models are applied for speech recognition. The multi-time step parallelization approach computes multiple output samples at a time with the parameters fetched from the DRAM. Since the number of DRAM accesses can be reduced in proportion to the number of parallelization steps, a high processing speed can be achieved for the parallelizable model.
In this thesis, a connectionist temporal classification (CTC) model is constructed by combining simple recurrent units (SRUs) and depth-wise 1-dimensional convolution layers for multi-time step parallelization. Both the character and word piece models are developed for the CTC model, and the corresponding RNN based language models are used for beam search decoding. A competitive WER for WSJ corpus is achieved using the entire model size of
approximately 15MB. The system operates in real-time speed using only a single core ARM without GPU or special hardware.
A low-latency on-device speech recognition system with a simple gated convolutional network (SGCN) is also proposed. The SGCN shows a competitive recognition accuracy even with 1M parameters. 8-bit quantization is applied to reduce the memory size and computation time. The proposed system features an online recognition with a 0.4s latency limit and operates in 0.2 RTF with only a single 900MHz CPU core.
In addition, an attention-based model with the depthwise convolutional encoder is proposed. Convolutional encoders enable faster training and inference of attention models than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length often suffers from looping or skipping problems. We believe that this is due to the time-invariance of convolutions. We attempt to remedy this issue by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of incorporating positional information. The streaming end-to-end ASR model is also developed by applying monotonic chunkwise attention.์ต๊ทผ ๋ชจ๋ฐ์ผ ๋ฐ ์๋ฒ ๋๋ ๊ธฐ๊ธฐ์์ ์ค์๊ฐ ๋์ํ๋ ์์ฑ ์ธ์ ์์คํ
์ ๊ฐ๋ฐํ๋ ๊ฒ์ด ํฐ ๊ด์ฌ์ ๋ฐ๊ณ ์๋ค. ๊น์ ์ธ๊ณต ์ ๊ฒฝ๋ง ์์ฑ์ธ์์ ๋ง์ ์์ ์ฐ์ฐ์ ํ์๋ก ํ๋ ๋ฐ๋ฉด, ๋ชจ๋ฐ์ผ ๊ธฐ๊ธฐ์ ๋ฉ๋ชจ๋ฆฌ ๋์ญํญ์ด๋ ์ ๋ ฅ์ ์ ํ๋์ด ์๋ค. ์ด๋ฌํ ํ๊ณ ๋๋ฌธ์ ์๋ฒ ๊ธฐ๋ฐ ๊ตฌํ์ด ๋ณดํต ์ฌ์ฉ๋์ด์ง์ง๋ง, ์ด๋ ์ง์ฐ ์๊ฐ ๋ฐ ์ฌ์ํ ์นจํด ๋ฌธ์ ๋ฅผ ์ผ์ผํจ๋ค. ๋ฐ๋ผ์ ๋ชจ๋ฐ์ผ ๊ธฐ๊ธฐ ์ ๋์ํ๋ ์์ฑ ์ธ์ ์์คํ
์ ์๊ตฌ๊ฐ ์ปค์ง๊ณ ์๋ค. ์์ฑ ์ธ์ ์์คํ
์ ์ฃผ๋ก ์ฌ์ฉ๋๋ ๋ชจ๋ธ์ ์ฌ๊ทํ ์ธ๊ณต ์ ๊ฒฝ๋ง์ด๋ค. ์ฌ๊ทํ ์ธ๊ณต ์ ๊ฒฝ๋ง์ ๋ชจ๋ธ ํฌ๊ธฐ๋ ๋ณดํต ์บ์์ ํฌ๊ธฐ๋ณด๋ค ํฌ๊ณ ํผ๋๋ฐฑ ๊ตฌ์กฐ ๋๋ฌธ์ ์ฌ์ฌ์ฉ์ด ์ด๋ ต๊ธฐ ๋๋ฌธ์ ๋ง์ DRAM ์ ๊ทผ์ ํ์๋ก ํ๋ค. ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ค์ค ์๊ฐ์ ์
๋ ฅ์๋ํด ๋ณ๋ ฌํ ๊ฐ๋ฅํ ๋ชจ๋ธ์ ์ด์ฉํ ์์ฑ ์ธ์ ์์คํ
์ ์ ์ํ๋ค. ๋ค์ค ์๊ฐ ๋ณ๋ ฌํ ๊ธฐ๋ฒ์ ํ ๋ฒ์ ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ์ผ๋ก ์ฌ๋ฌ ์๊ฐ์ ์ถ๋ ฅ์ ๋์์ ๊ณ์ฐํ๋ ๋ฐฉ๋ฒ์ด๋ค. ๋ณ๋ ฌํ ์์ ๋ฐ๋ผ DRAM ์ ๊ทผ ํ์๋ฅผ ์ค์ผ ์ ์๊ธฐ ๋๋ฌธ์, ๋ณ๋ ฌํ ๊ฐ๋ฅํ ๋ชจ๋ธ์ ๋ํ์ฌ ๋น ๋ฅธ ์ฐ์ฐ์ด ๊ฐ๋ฅํ๋ค.
๋จ์ ์ฌ๊ท ์ ๋๊ณผ 1์ฐจ์ ์ปจ๋ฒ๋ฃจ์
์ ์ด์ฉํ CTC ๋ชจ๋ธ์ ์ ์ํ์๋ค. ๋ฌธ์์ ๋จ์ด ์กฐ๊ฐ ์์ค์ ๋ชจ๋ธ์ด ๊ฐ๋ฐ๋์๋ค. ๊ฐ ์ถ๋ ฅ ๋จ์์ ํด๋นํ๋ ์ฌ๊ทํ ์ ๊ฒฝ๋ง ๊ธฐ๋ฐ ์ธ์ด ๋ชจ๋ธ์ ์ด์ฉํ์ฌ ๋์ฝ๋ฉ์ ์ฌ์ฉ๋์๋ค. ์ ์ฒด 15MB์ ๋ฉ๋ชจ๋ฆฌ ํฌ๊ธฐ๋ก WSJ ์์ ๋์ ์์ค์ ์ธ์ ์ฑ๋ฅ์ ์ป์์ผ๋ฉฐ GPU๋ ๊ธฐํ ํ๋์จ์ด ์์ด 1๊ฐ์ ARM CPU ์ฝ์ด๋ก ์ค์๊ฐ ์ฒ๋ฆฌ๋ฅผ ๋ฌ์ฑํ์๋ค.
๋ํ ๋จ์ ์ปจ๋ฒ๋ฃจ์
์ธ๊ณต ์ ๊ฒฝ๋ง (SGCN)์ ์ด์ฉํ ๋ฎ์ ์ง์ฐ์๊ฐ์ ๊ฐ์ง๋ ์์ฑ์ธ์ ์์คํ
์ ๊ฐ๋ฐํ์๋ค. SGCN์ 1M์ ๋งค์ฐ ๋ฎ์ ๋ณ์ ๊ฐฏ์๋ก๋ ๊ฒฝ์๋ ฅ ์๋ ์ธ์ ์ ํ๋๋ฅผ ๋ณด์ฌ์ค๋ค. ์ถ๊ฐ์ ์ผ๋ก 8-bit ์์ํ๋ฅผ ์ ์ฉํ์ฌ ๋ฉ๋ชจ๋ฆฌ ํฌ๊ธฐ์ ์ฐ์ฐ ์๊ฐ์ ๊ฐ์ ์์ผฐ๋ค. ํด๋น ์์คํ
์ 0.4์ด์ ์ด๋ก ์ ์ง์ฐ์๊ฐ์ ๊ฐ์ง๋ฉฐ 900MHz์ CPU ์์์ 0.2์ RTF๋ก ๋์ํ์๋ค.
์ถ๊ฐ์ ์ผ๋ก, ๊น์ด๋ณ ์ปจ๋ฒ๋ฃจ์
์ธ์ฝ๋๋ฅผ ์ด์ฉํ ์ดํ
์
๊ธฐ๋ฐ ๋ชจ๋ธ์ด ๊ฐ๋ฐ๋์๋ค. ์ปจ๋ฒ๋ฃจ์
๊ธฐ๋ฐ์ ์ธ์ฝ๋๋ ์ฌ๊ทํ ์ธ๊ณต ์ ๊ฒฝ๋ง ๊ธฐ๋ฐ ๋ชจ๋ธ๋ณด๋ค ๋น ๋ฅธ ์ฒ๋ฆฌ ์๋๋ฅผ ๊ฐ์ง๋ค. ํ์ง๋ง ์ปจ๋ฒ๋ฃจ์
๋ชจ๋ธ์ ๋์ ์ฑ๋ฅ์ ์ํด์ ํฐ ์
๋ ฅ ๋ฒ์๋ฅผ ํ์๋ก ํ๋ค. ์ด๋ ๋ชจ๋ธ ํฌ๊ธฐ ๋ฐ ์ฐ์ฐ๋, ๊ทธ๋ฆฌ๊ณ ๋์ ์ ๋ฉ๋ชจ๋ฆฌ ์๋ชจ๋ฅผ ์ฆ๊ฐ ์ํจ๋ค. ์์ ํฌ๊ธฐ์ ์
๋ ฅ ๋ฒ์๋ฅผ ๊ฐ์ง๋ ์ปจ๋ฒ๋ฃจ์
์ธ์ฝ๋๋ ์ถ๋ ฅ์ ๋ฐ๋ณต์ด๋ ์๋ต์ผ๋ก ์ธํ์ฌ ๋์ ์ค์ฐจ์จ์ ๊ฐ์ง๋ค. ์ด๊ฒ์ ์ปจ๋ฒ๋ฃจ์
์ ์๊ฐ ๋ถ๋ณ์ฑ ๋๋ฌธ์ผ๋ก ์ฌ๊ฒจ์ง๋ฉฐ, ์ด ๋ฌธ์ ๋ฅผ ์์น ์ธ์ฝ๋ฉ ๋ฒกํฐ๋ฅผ ์ด์ฉํ์ฌ ํด๊ฒฐํ์๋ค. ์์น ์ ๋ณด๋ฅผ ์ด์ฉํ์ฌ ์์ ํฌ๊ธฐ์ ํํฐ๋ฅผ ๊ฐ์ง๋ ์ปจ๋ฒ๋ฃจ์
๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋์ผ ์ ์์์ ๋ณด์๋ค. ๋ํ ์์น ์ ๋ณด๊ฐ ๊ฐ์ง๋ ์ํฅ์ ์๊ฐํ ํ์๋ค. ํด๋น ๋ฐฉ๋ฒ์ ๋จ์กฐ ์ดํ
์
์ ์ด์ฉํ ๋ชจ๋ธ์ ํ์ฉํ์ฌ ์ปจ๋ฒ๋ฃจ์
๊ธฐ๋ฐ์ ์คํธ๋ฆฌ๋ฐ ๊ฐ๋ฅํ ์์ฑ ์ธ์ ์์คํ
์ ๊ฐ๋ฐํ์๋ค.1 Introduction 1
1.1 End-to-End Automatic Speech Recognition with Neural Networks . . 1
1.2 Challenges on On-device Implementation of Neural Network-based ASR 2
1.3 Parallelizable Neural Network Architecture 3
1.4 Scope of Dissertation 3
2 Simple Recurrent Units for CTC-based End-to-End Speech Recognition 6
2.1 Introduction 6
2.2 Related Works 8
2.3 Speech Recognition Algorithm 9
2.3.1 Acoustic modeling 10
2.3.2 Character-based model 12
2.3.3 Word piece-based model 14
2.3.4 Decoding 14
2.4 Experimental Results 15
2.4.1 Acoustic models 15
2.4.2 Word piece based speech recognition 22
2.4.3 Execution time analysis 25
2.5 Concluding Remarks 27
3 Low-Latency Lightweight Streaming Speech Recognition with 8-bit Quantized Depthwise Gated Convolutional Neural Networks 28
3.1 Introduction 28
3.2 Simple Gated Convolutional Networks 30
3.2.1 Model structure 30
3.2.2 Multi-time-step parallelization 31
3.3 Training CTC AM with SGCN 34
3.3.1 Regularization with symmetrical weight noise injection 34
3.3.2 8-bit quantization 34
3.4 Experimental Results 36
3.4.1 Experimental setting 36
3.4.2 Results on WSJ eval92 38
3.4.3 Implementation on the embedded system 38
3.5 Concluding Remarks 39
4 Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition 41
4.1 Introduction 41
4.2 Related Works 43
4.3 Model Description 45
4.4 Experimental Results 46
4.4.1 Effect of receptive field size 46
4.4.2 Visualization 49
4.4.3 Comparison with other models 53
4.5 Concluding Remarks 53
5 Convolution-based Attention Model with Positional Encoding for Streaming Speech Recognition 55
5.1 Introduction 55
5.2 Related Works 58
5.3 End-to-End Model for Speech Recognition 61
5.3.1 Model description 61
5.3.2 Monotonic chunkwise attention 62
5.3.3 Positional encoding 63
5.4 Experimental Results 64
5.4.1 Effect of positional encoding 66
5.4.2 Comparison with other models 68
5.4.3 Execution time analysis 70
5.5 Concluding Remarks 71
6 Conclusion 72
Abstract (In Korean) 86Docto
- โฆ