843 research outputs found
Fixed-Point Performance Analysis of Recurrent Neural Networks
Recurrent neural networks have shown excellent performance in many
applications, however they require increased complexity in hardware or software
based implementations. The hardware complexity can be much lowered by
minimizing the word-length of weights and signals. This work analyzes the
fixed-point performance of recurrent neural networks using a retrain based
quantization method. The quantization sensitivity of each layer in RNNs is
studied, and the overall fixed-point optimization results minimizing the
capacity of weights while not sacrificing the performance are presented. A
language model and a phoneme recognition examples are used
A Parallel Decomposition Scheme for Solving Long-Horizon Optimal Control Problems
We present a temporal decomposition scheme for solving long-horizon optimal
control problems. In the proposed scheme, the time domain is decomposed into a
set of subdomains with partially overlapping regions. Subproblems associated
with the subdomains are solved in parallel to obtain local primal-dual
trajectories that are assembled to obtain the global trajectories. We provide a
sufficient condition that guarantees convergence of the proposed scheme. This
condition states that the effect of perturbations on the boundary conditions
(i.e., initial state and terminal dual/adjoint variable) should decay
asymptotically as one moves away from the boundaries. This condition also
reveals that the scheme converges if the size of the overlap is sufficiently
large and that the convergence rate improves with the size of the overlap. We
prove that linear quadratic problems satisfy the asymptotic decay condition,
and we discuss numerical strategies to determine if the condition holds in more
general cases. We draw upon a non-convex optimal control problem to illustrate
the performance of the proposed scheme
B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives
Previous research addressed the potential problems of the hard-disk oriented
design of DBMSs of flashSSDs. In this paper, we focus on exploiting potential
benefits of flashSSDs. First, we examine the internal parallelism issues of
flashSSDs by conducting benchmarks to various flashSSDs. Then, we suggest
algorithm-design principles in order to best benefit from the internal
parallelism. We present a new I/O request concept, called psync I/O that can
exploit the internal parallelism of flashSSDs in a single process. Based on
these ideas, we introduce B+-tree optimization methods in order to utilize
internal parallelism. By integrating the results of these methods, we present a
B+-tree variant, PIO B-tree. We confirmed that each optimization method
substantially enhances the index performance. Consequently, PIO B-tree enhanced
B+-tree's insert performance by a factor of up to 16.3, while improving
point-search performance by a factor of 1.2. The range search of PIO B-tree was
up to 5 times faster than that of the B+-tree. Moreover, PIO B-tree
outperformed other flash-aware indexes in various synthetic workloads. We also
confirmed that PIO B-tree outperforms B+-tree in index traces collected inside
the Postgresql DBMS with TPC-C benchmark.Comment: VLDB201
SME supply chain collaboration innovation using an online hub
노트 : Proceedings of the 8th International Conference on Innovation & Managemen
A study on carbon neutral city plan of Sejong city
노트 : International Conference on Sustainable Building Asi
FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks
In this paper, a neural network based real-time speech recognition (SR)
system is developed using an FPGA for very low-power operation. The implemented
system employs two recurrent neural networks (RNNs); one is a
speech-to-character RNN for acoustic modeling (AM) and the other is for
character-level language modeling (LM). The system also employs a statistical
word-level LM to improve the recognition accuracy. The results of the AM, the
character-level LM, and the word-level LM are combined using a fairly simple
N-best search algorithm instead of the hidden Markov model (HMM) based network.
The RNNs are implemented using massively parallel processing elements (PEs) for
low latency and high throughput. The weights are quantized to 6 bits to store
all of them in the on-chip memory of an FPGA. The proposed algorithm is
implemented on a Xilinx XC7Z045, and the system can operate much faster than
real-time.Comment: Accepted to SiPS 201
Workload-aware Automatic Parallelization for Multi-GPU DNN Training
Deep neural networks (DNNs) have emerged as successful solutions for variety
of artificial intelligence applications, but their very large and deep models
impose high computational requirements during training. Multi-GPU
parallelization is a popular option to accelerate demanding computations in DNN
training, but most state-of-the-art multi-GPU deep learning frameworks not only
require users to have an in-depth understanding of the implementation of the
frameworks themselves, but also apply parallelization in a straight-forward way
without optimizing GPU utilization. In this work, we propose a workload-aware
auto-parallelization framework (WAP) for DNN training, where the work is
automatically distributed to multiple GPUs based on the workload
characteristics. We evaluate WAP using TensorFlow with popular DNN benchmarks
(AlexNet and VGG-16), and show competitive training throughput compared with
the state-of-the-art frameworks, and also demonstrate that WAP automatically
optimizes GPU assignment based on the workload's compute requirements, thereby
improving energy efficiency.Comment: This paper is accepted in ICASSP201
- …
