100 research outputs found

    High performance and error resilient probabilistic inference system for machine learning

    Get PDF
    Many real-world machine learning applications can be considered as inferring the best label assignment of maximum a posteriori probability (MAP) problems. Since these MAP problems are NP-hard in general, they are often dealt with using approximate inference algorithms on Markov random field (MRF) such as belief propagation (BP). However, this approximate inference is still computationally demanding, and thus custom hardware accelerators have been attractive for high performance and energy efficiency. There are various custom hardware implementations that employ BP to achieve reasonable performance for the real-world applications such as stereo matching. Due to lack of convergence guarantees, however, BP often fails to provide the right answer, thus degrading performance of the hardware. Therefore, we consider sequential tree-reweighted message passing (TRW-S), which avoids many of these convergence problems with BP via sequential execution of its computations but challenges parallel implementation for high throughput. In this work, therefore, we propose a novel streaming hardware architecture that parallelizes the sequential computations of TRW-S. Experimental results on stereo matching benchmarks show promising performance of our hardware implementation compared to the software implementation as well as other BP-based custom hardware or GPU implementations. From this result, we further demonstrate video-rate speed and high quality stereo matching using a hybrid CPU+FPGA platform. We propose three frame-level optimization techniques to fully exploit computational resources of a hybrid CPU+FPGA platform and achieve significant speed-up. We first propose a message reuse scheme which is guided by simple scene change detection. This scheme allows a current inference to be made based on a determination of whether the current result is expected to be similar to the inference result of the previous frame. We also consider frame level parallelization to process multiple frames in parallel using multiple FPGAs available in the platform. This parallelized hardware procedure is further pipelined with data management in CPU to overlap the execution time of the two and thereby reduce the entire processing time of the stereo video sequence. From experimental results with the real-world stereo video sequences, we see video-rate speed of our stereo matching system for QVGA stereo videos. Next, we consider error resilience of the message passing hardware for energy efficient hardware implementation. Modern nanoscale CMOS process technologies suffer in reliability caused by process, temperature and voltage variations. Conventional approaches to deal with such unreliability (e.g., design for the worst-case scenario) are complex and inefficient in terms of hardware resources and energy consumption. As machine learning applications are inherently probabilistic and robust to errors, statistical error compensation (SEC) techniques can play a significant role in achieving robust and energy-efficient implementation. SEC embraces the statistical nature of errors and utilizes statistical and probabilistic techniques to build robust systems. Energy-efficiency is obtained by trading off the enhanced robustness with energy. In this work, we analyze the error resilience of our message passing inference hardware subject to the hardware errors (e.g. errors caused by timing violation in circuits) and explore application of a popular SEC technique, algorithmic noise tolerance (ANT), to this hardware. Analysis and simulations show that the TRW-S message passing hardware is tolerant to small magnitude arithmetic errors, but large magnitude errors cause significantly inaccurate inference results which need to be corrected using SEC. Experimental results show that the proposed ANT-based hardware can tolerate an error rate of 21.3%, with performance degradation of only 3.5 % with an energy savings of 39.7 %, compared to an error-free hardware. Lastly, we extend our TRW-S hardware toward a general purpose machine learning framework. We propose advanced streaming architecture with flexible choice of MRF setting to achieve 10-40x speedup across a variety of computer vision applications. Furthermore, we provide better theoretical understanding of error resiliency of TRW-S, and of the implication of ANT for TRW-S, under more general MRF setting, along with strong empirical support

    DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

    Full text link
    We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs for an image segmentation problem. Compared to a serial baseline, we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare our performance to a reference, OpenMP-based algorithm, and find speedups of up to 7X (CPU).Comment: LDAV 2018, October 201

    A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

    Get PDF
    The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in the consumer Internet of Things (IoT) market. As these technologies are being applied to keyword spotting (KWS), automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications, it is of paramount importance that they provide uncompromising performance for context learning in long sequences, which is a key benefit of the attention mechanism, and that they work seamlessly in polyphonic environments. In this work, we present a 25-mm 2^2 system-on-chip (SoC) in 16-nm FinFET technology, codenamed SM6, which executes end-to-end speech-enhancing attention-based ASR and NLP workloads. The SoC includes: 1) FlexASR, a highly reconfigurable NLP inference processor optimized for whole-model acceleration of bidirectional attention-based sequence-to-sequence (seq2seq) deep neural networks (DNNs); 2) a Markov random field source separation engine (MSSE), a probabilistic graphical model accelerator for unsupervised inference via Gibbs sampling, used for sound source separation; 3) a dual-core Arm Cortex A53 CPU cluster, which provides on-demand single Instruction/multiple data (SIMD) fast fourier transform (FFT) processing and performs various application logic (e.g., expectation–maximization (EM) algorithm and 8-bit floating-point (FP8) quantization); and 4) an always-on M0 subsystem for audio detection and power management. Measurement results demonstrate the efficiency ranges of 2.6–7.8 TFLOPs/W and 4.33–17.6 Gsamples/s/W for FlexASR and MSSE, respectively; MSSE denoising performance allowing 6 ×\times smaller ASR model to be stored on-chip with negligible accuracy loss; and 2.24-mJ energy consumption while achieving real-time throughput, end-to-end, and per-frame ASR latencies of 18 ms

    Algorithms for massively parallel, event-based hardware

    Full text link

    Maximum Persistency via Iterative Relaxed Inference with Graphical Models

    Full text link
    We consider the NP-hard problem of MAP-inference for undirected discrete graphical models. We propose a polynomial time and practically efficient algorithm for finding a part of its optimal solution. Specifically, our algorithm marks some labels of the considered graphical model either as (i) optimal, meaning that they belong to all optimal solutions of the inference problem; (ii) non-optimal if they provably do not belong to any solution. With access to an exact solver of a linear programming relaxation to the MAP-inference problem, our algorithm marks the maximal possible (in a specified sense) number of labels. We also present a version of the algorithm, which has access to a suboptimal dual solver only and still can ensure the (non-)optimality for the marked labels, although the overall number of the marked labels may decrease. We propose an efficient implementation, which runs in time comparable to a single run of a suboptimal dual solver. Our method is well-scalable and shows state-of-the-art results on computational benchmarks from machine learning and computer vision.Comment: Reworked version, submitted to PAM

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
    corecore