25 research outputs found
Batch Size Influence on Performance of Graphic and Tensor Processing Units during Training and Inference Phases
The impact of the maximally possible batch size (for the better runtime) on
performance of graphic processing units (GPU) and tensor processing units (TPU)
during training and inference phases is investigated. The numerous runs of the
selected deep neural network (DNN) were performed on the standard MNIST and
Fashion-MNIST datasets. The significant speedup was obtained even for extremely
low-scale usage of Google TPUv2 units (8 cores only) in comparison to the quite
powerful GPU NVIDIA Tesla K80 card with the speedup up to 10x for training
stage (without taking into account the overheads) and speedup up to 2x for
prediction stage (with and without taking into account overheads). The precise
speedup values depend on the utilization level of TPUv2 units and increase with
the increase of the data volume under processing, but for the datasets used in
this work (MNIST and Fashion-MNIST with images of sizes 28x28) the speedup was
observed for batch sizes >512 images for training phase and >40 000 images for
prediction phase. It should be noted that these results were obtained without
detriment to the prediction accuracy and loss that were equal for both GPU and
TPU runs up to the 3rd significant digit for MNIST dataset, and up to the 2nd
significant digit for Fashion-MNIST dataset.Comment: 10 pages, 7 figures, 2 table
Fixed Point Analysis Workflow for efficient Design of Convolutional Neural Networks in Hearing Aids
Neural networks (NN) are a powerful tool to tackle complex problems in hearing aid research, but their use on hearing aid hardware is currently limited by memory and processing power. To enable the training with these constrains, a fixed point analysis and a memory friendly power of two quantization (replacing multiplications with shift operations) scheme has been implemented extending TensorFlow, a standard framework for training neural networks, and the Qkeras package [1, 2]. The implemented fixed point analysis detects quantization issues like overflows, underflows, precision problems and zero gradients. The analysis is done for each layer in every epoch for weights, biases and activations respectively. With this information the quantization can be optimized, e.g. by modifying the bit width, number of integer bits or the quantization scheme to a power of two quantization. To demonstrate the applicability of this method a case study has been conducted. Therefore a CNN has been trained to predict the Ideal Ratio Mask (IRM) for noise reduction in audio signals. The dataset consists of speech samples from the TIMIT dataset mixed with noise from the Urban Sound 8kand VAD-dataset at 0 dB SNR. The CNN was trained in floating point, fixed point and a power of two quantization. The CNN architecture consists of six convolutional layers followed by three dense layers. From initially 1.9 MB memory footprint for 468k float32 weights, the power of two quantized network is reduced to 236 kB, while the Short Term Objective Intelligibility (STOI) Improvement drops only from 0.074 to 0.067. Despite the quantization only a minimal drop in performance was observed, while saving up to 87.5 % of memory, thus being suited for employment in a hearing ai
Convolutional Neural Networks for Speech Controlled Prosthetic Hands
Speech recognition is one of the key topics in artificial intelligence, as it
is one of the most common forms of communication in humans. Researchers have
developed many speech-controlled prosthetic hands in the past decades,
utilizing conventional speech recognition systems that use a combination of
neural network and hidden Markov model. Recent advancements in general-purpose
graphics processing units (GPGPUs) enable intelligent devices to run deep
neural networks in real-time. Thus, state-of-the-art speech recognition systems
have rapidly shifted from the paradigm of composite subsystems optimization to
the paradigm of end-to-end optimization. However, a low-power embedded GPGPU
cannot run these speech recognition systems in real-time. In this paper, we
show the development of deep convolutional neural networks (CNN) for speech
control of prosthetic hands that run in real-time on a NVIDIA Jetson TX2
developer kit. First, the device captures and converts speech into 2D features
(like spectrogram). The CNN receives the 2D features and classifies the hand
gestures. Finally, the hand gesture classes are sent to the prosthetic hand
motion control system. The whole system is written in Python with Keras, a deep
learning library that has a TensorFlow backend. Our experiments on the CNN
demonstrate the 91% accuracy and 2ms running time of hand gestures (text
output) from speech commands, which can be used to control the prosthetic hands
in real-time.Comment: 2019 First International Conference on Transdisciplinary AI
(TransAI), Laguna Hills, California, USA, 2019, pp. 35-4
Periodic orbits in chaotic systems simulated at low precision
Non-periodic solutions are an essential property of chaotic dynamical systems. Simulations with deterministic finite-precision numbers, however, always yield orbits that are eventually periodic. With 64-bit double-precision floating-point numbers such periodic orbits are typically negligible due to very long periods. The emerging trend to accelerate simulations with low-precision numbers, such as 16-bit half-precision floats, raises questions on the fidelity of such simulations of chaotic systems. Here, we revisit the 1-variable logistic map and the generalised Bernoulli map with various number formats and precisions: floats, posits and logarithmic fixed-point. Simulations are improved with higher precision but stochastic rounding prevents periodic orbits even at low precision. For larger systems the performance gain from low-precision simulations is often reinvested in higher resolution or complexity, increasing the number of variables. In the Lorenz 1996 system, the period lengths of orbits increase exponentially with the number of variables. Moreover, invariant measures are better approximated with an increased number of variables than with increased precision. Extrapolating to large simulations of natural systems, such as million-variable climate models, periodic orbit lengths are far beyond reach of present-day computers. Such orbits are therefore not expected to be problematic compared to high-precision simulations but the deviation of both from the continuum solution remains unclear
Area-Efficient FPGA Implementation of Minimalistic Convolutional Neural Network Using Residue Number System
Convolutional Neural Networks (CNN) is the promising tool for solving task of image recognition in computer vision systems. However, the most known implementation of CNNs require a significant amount of memory for storing weights in training and work. To reduce the resource costs of CNN implementation we propose the architecture that separated on hardware and software parts for performance optimization. Also we propose to use Residue Number System (RNS) arithmetic in the hardware part which implements the convolutional layer of CNN. Software simulation using Matlab 2017b shows that CNN with a minimum number of layers can be quickly and successfully trained. Hardware simulation using FPGA Kintex7 xc7k70tfbg484-2 demonstrates that using RNS in convolutional layer of CNN allows to reduce hardware costs by 32% compared with the traditional approach based on the binary number system