712 research outputs found

    Accelerating Training of Deep Neural Networks via Sparse Edge Processing

    Full text link
    We propose a reconfigurable hardware architecture for deep neural networks (DNNs) capable of online training and inference, which uses algorithmically pre-determined, structured sparsity to significantly lower memory and computational requirements. This novel architecture introduces the notion of edge-processing to provide flexibility and combines junction pipelining and operational parallelization to speed up training. The overall effect is to reduce network complexity by factors up to 30x and training time by up to 35x relative to GPUs, while maintaining high fidelity of inference results. This has the potential to enable extensive parameter searches and development of the largely unexplored theoretical foundation of DNNs. The architecture automatically adapts itself to different network sizes given available hardware resources. As proof of concept, we show results obtained for different bit widths.Comment: Presented at the 26th International Conference on Artificial Neural Networks (ICANN) 2017 in Alghero, Ital

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression

    Neuromorphic Systems for Pattern Recognition and Uav Trajectory Planning

    Get PDF
    Detection and control are two essential components in an intelligent system. This thesis investigates novel techniques in both areas with a focus on the applications of handwritten text recognition and UAV flight control. Recognizing handwritten texts is a challenging task due to many different writing styles and lack of clear boundary between adjacent characters. The difficulty is greatly increased if the detection algorithms is solely based on pattern matching without information of dynamics of handwriting trajectories. Motivated by the aforementioned challenges, this thesis first investigates the pattern recognition problem. We use offline handwritten texts recognition as a case study to explore the performance of a recurrent belief propagation model. We first develop a probabilistic inference network to post process the recognition results of deep Convolutional Neural Network (CNN) (e.g. LeNet) and collect individual characters to form words. The output of the inference network is a set of words and their probability. A series of post processing and improvement techniques are then introduced to further increase the recognition accuracy. We study the performance of proposed model through various comparisons. The results show that it significantly improves the accuracy by correcting deletion, insertion and replacement errors, which are the main sources of invalid candidate words. Deep Reinforcement Learning (DRL) has widely been applied to control the autonomous systems because it provides solutions for various complex decision-making tasks that previously could not be solved solely with deep learning. To enable autonomous Unmanned Aerial Vehicles (UAV), this thesis presents a two-level trajectory planning framework for UAVs in an indoor environment. A sequence of waypoints is selected at the higher-level, which leads the UAV from its current position to the destination. At the lower-level, an optimal trajectory is generated analytically between each pair of adjacent waypoints. The goal of trajectory generation is to maintain the stability of the UAV, and the goal of the waypoints planning is to select waypoints with the lowest control thrust throughout the entire trip while avoiding collisions with obstacles. The entire framework is implemented using DRL, which learns the highly complicated and nonlinear interaction between those two levels, and the impact from the environment. Given the pre-planned trajectory, this thesis further presents an actor-critic reinforcement learning framework that realizes continuous trajectory control of the UAV through a set of desired waypoints. We construct a deep neural network and develop reinforcement learning for better trajectory tracking. In addition, Field Programmable Gate Arrays (FPGA) based hardware acceleration is designed for energy efficient real-time control. If we are to integrate the trajectory planning model onto a UAV system for real-time on-board planning, a key challenge is how to deliver required performance under strict memory and computational constraints. Techniques that compress Deep Neural Network (DNN) models attract our attention because they allow optimized neural network models to be efficiently deployed on platforms with limited energy and storage capacity. However, conventional model compression techniques prune the DNN after it is fully trained, which is very time-consuming especially when the model is trained using DRL. To overcome the limitation, we present an early phase integrated neural network weight compression system for DRL based waypoints planning. By applying pruning at an early phase, the compression of the DRL model can be realized without significant overhead in training. By tightly integrating pruning and retraining at the early phase, we achieve a higher model compression rate, reduce more memory and computing complexity, and improve the success rate compared to the original work

    Speeding up Convolutional Neural Networks with Low Rank Expansions

    Full text link
    The focus of this paper is speeding up the evaluation of convolutional neural networks. While delivering impressive results across a range of computer vision and machine learning tasks, these networks are computationally demanding, limiting their deployability. Convolutional layers generally consume the bulk of the processing time, and so in this work we present two simple schemes for drastically speeding up these layers. This is achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain. Our methods are architecture agnostic, and can be easily applied to existing CPU and GPU convolutional frameworks for tuneable speedup performance. We demonstrate this with a real world network designed for scene text character recognition, showing a possible 2.5x speedup with no loss in accuracy, and 4.5x speedup with less than 1% drop in accuracy, still achieving state-of-the-art on standard benchmarks

    Hardware Considerations for Signal Processing Systems: A Step Toward the Unconventional.

    Full text link
    As we progress into the future, signal processing algorithms are becoming more computationally intensive and power hungry while the desire for mobile products and low power devices is also increasing. An integrated ASIC solution is one of the primary ways chip developers can improve performance and add functionality while keeping the power budget low. This work discusses ASIC hardware for both conventional and unconventional signal processing systems, and how integration, error resilience, emerging devices, and new algorithms can be leveraged by signal processing systems to further improve performance and enable new applications. Specifically this work presents three case studies: 1) a conventional and highly parallel mix signal cross-correlator ASIC for a weather satellite performing real-time synthetic aperture imaging, 2) an unconventional native stochastic computing architecture enabled by memristors, and 3) two unconventional sparse neural network ASICs for feature extraction and object classification. As improvements from technology scaling alone slow down, and the demand for energy efficient mobile electronics increases, such optimization techniques at the device, circuit, and system level will become more critical to advance signal processing capabilities in the future.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116685/1/knagphil_1.pd
    • …
    corecore