2,626 research outputs found

    Predictive intelligence to the edge through approximate collaborative context reasoning

    Get PDF
    We focus on Internet of Things (IoT) environments where a network of sensing and computing devices are responsible to locally process contextual data, reason and collaboratively infer the appearance of a specific phenomenon (event). Pushing processing and knowledge inference to the edge of the IoT network allows the complexity of the event reasoning process to be distributed into many manageable pieces and to be physically located at the source of the contextual information. This enables a huge amount of rich data streams to be processed in real time that would be prohibitively complex and costly to deliver on a traditional centralized Cloud system. We propose a lightweight, energy-efficient, distributed, adaptive, multiple-context perspective event reasoning model under uncertainty on each IoT device (sensor/actuator). Each device senses and processes context data and infers events based on different local context perspectives: (i) expert knowledge on event representation, (ii) outliers inference, and (iii) deviation from locally predicted context. Such novel approximate reasoning paradigm is achieved through a contextualized, collaborative belief-driven clustering process, where clusters of devices are formed according to their belief on the presence of events. Our distributed and federated intelligence model efficiently identifies any localized abnormality on the contextual data in light of event reasoning through aggregating local degrees of belief, updates, and adjusts its knowledge to contextual data outliers and novelty detection. We provide comprehensive experimental and comparison assessment of our model over real contextual data with other localized and centralized event detection models and show the benefits stemmed from its adoption by achieving up to three orders of magnitude less energy consumption and high quality of inference

    ์˜จ-๋””๋ฐ”์ด์Šค ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ๋ฅผ ์œ„ํ•œ ๊ณ ์„ฑ๋Šฅ ์—ฐ์‚ฐ ์œ ๋‹› ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ๊น€ํƒœํ™˜.Optimizing computing units for an on-device neural network accelerator can bring less energy and latency, more throughput, and might enable unprecedented new applications. This dissertation studies on two specific optimization opportunities of multiplyaccumulate (MAC) unit for on-device neural network accelerator stem from precision quantization methodology. Firstly, we propose an enhanced MAC processing unit structure efficiently processing mixed-precision model with majority operations with low precision. Precisely, two essential works are: (1) MAC unit structure supporting two precision modes is designed for fully utilizing its computation logic when processing lower precision data, which brings more computation efficiency for mixed-precision models whose major operations are in lower precision; (2) for a set of input CNNs, we formulate the exploration of the size of a single internal multiplier in MAC unit to derive an economical instance, in terms of computation and energy cost, of MAC unit structure across the whole network layers. Experimental results with two well-known CNN models, AlexNet and VGG-16, and two experimental precision settings showed that proposed units can reduce computational cost per multiplication by 4.68โˆผ30.3% and save energy cost by 43.3% on average over conventional units. Secondly, we propose an acceleration technique for processing multiplication operations using stochastic computing (SC). MUX-FSM based SC, which employs a MUX controlled by an FSM to generate a bit sequence of a binary number to count up for a MAC operation, considerably reduces the hardware cost for implementing MAC operations over the traditional stochastic number generator (SNG) based SC. Nevertheless, the existing MUX-FSM based SC still does not meet the multiplication processing time required for a wide adoption of on-device neural networks in practice even though it offers a very economical hardware implementation. Also, conventional enhancements have their limitation for sub-maximal cycle reduction, parameter conversion cost, etc. This work proposes a solution to the problem of further speeding up the conventional MUX-FSM based SC. Precisely, we analyze the bit counting pattern produced by MUX-FSM and replace the counting redundancy by shift operation, resulting in reducing the length of the required bit sequence significantly, theoretically speeding up the worst-case multiplication processing time by 2X or more. Through experiments, it is shown that our enhanced SC technique is able to shorten the average processing time by 38.8% over the conventional MUX-FSM based SC.์˜จ-๋””๋ฐ”์ด์Šค ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ๋ฅผ ์œ„ํ•œ ์—ฐ์‚ฐ ํšŒ๋กœ ์ตœ์ ํ™”๋Š” ์ €์ „๋ ฅ, ์ €์ง€์—ฐ์‹œ๊ฐ„, ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ์ด์ „์— ๋ถˆ๊ฐ€ํ•˜์˜€๋˜ ์ƒˆ๋กœ์šด ์‘์šฉ์„ ๊ฐ€๋Šฅ์ผ€ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜จ-๋””๋ฐ”์ด์Šค ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ์˜ ๊ณฑ์…ˆ-๋ˆ„์ ํ•ฉ ์—ฐ์‚ฐ๊ธฐ(MAC)์— ๋Œ€ํ•ด ์ •๋ฐ€๋„ ์–‘์žํ™” ๊ธฐ๋ฒ• ์ ์šฉ ๊ณผ์ •์—์„œ ํŒŒ์ƒํ•œ ๋‘ ๊ฐ€์ง€ ํŠน์ •ํ•œ ์ตœ์ ํ™” ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ๋‚ฎ์€ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ์ด ๋Œ€๋‹ค์ˆ˜๋ฅผ ์ฐจ์ง€ํ•˜๋„๋ก ์ค€๋น„๋œ ๋‹ค์ค‘ ์ •๋ฐ€๋„๊ฐ€ ์ ์šฉ๋œ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ์„ ๋œ MAC ์—ฐ์‚ฐ ์œ ๋‹› ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ๊ธฐ์—ฌ์ ์„ ์ œ์•ˆํ•œ๋‹ค: (1) ์ œ์•ˆํ•œ ๋‘ ๊ฐ€์ง€ ์ •๋ฐ€๋„ ๋ชจ๋“œ๋ฅผ ์ง€์›ํ•˜๋Š” MAC ์œ ๋‹› ๊ตฌ์กฐ๋Š” ๋‚ฎ์€ ์ •๋ฐ€๋„ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฐ์‚ฐํ•  ๋•Œ ์œ ๋‹›์˜ ์—ฐ์‚ฐ ํšŒ๋กœ๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๋„๋ก ์„ค๊ณ„๋˜๋ฉฐ, ๋‚ฎ์€ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ ๋น„์œจ์ด ๋Œ€๋‹ค์ˆ˜๋ฅผ ์ฐจ์ง€ํ•˜๋Š” ๋‹ค์ค‘ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ ๋ชจ๋ธ์— ๋” ๋†’์€ ์—ฐ์‚ฐ ํšจ์œจ์„ ์ œ๊ณตํ•œ๋‹ค; (2) ์—ฐ์‚ฐ ๋Œ€์ƒ CNN ๋„คํŠธ์›Œํฌ์— ๋Œ€ํ•ด, MAC ์œ ๋‹›์˜ ๋‚ด๋ถ€ ๊ณฑ์…ˆ๊ธฐ์˜ `๊ฒฝ์ œ์ ์ธ' (๋น„ํŠธ) ํฌ๊ธฐ๋ฅผ ํƒ์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ ๋น„์šฉ ํ•จ์ˆ˜๋ฅผ, ์ „์ฒด ๋„คํŠธ์›Œํฌ ๋ ˆ์ด์–ด๋ฅผ ์—ฐ์‚ฐ ๋Œ€์ƒ์œผ๋กœ ํ•˜์—ฌ ์—ฐ์‚ฐ ๋น„์šฉ๊ณผ ์—๋„ˆ์ง€ ๋น„์šฉ ํ•ญ์œผ๋กœ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๋„๋ฆฌ ์•Œ๋ ค์ง„ AlexNet๊ณผ VGG-16 CNN ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ, ๊ทธ๋ฆฌ๊ณ  ๋‘ ๊ฐ€์ง€ ์‹คํ—˜ ์ƒ ์ •๋ฐ€๋„ ๊ตฌ์„ฑ์— ๋Œ€ํ•˜์—ฌ, ์‹คํ—˜ ๊ฒฐ๊ณผ ์ œ์•ˆํ•œ ์œ ๋‹›์ด ๊ธฐ์กด ์œ ๋‹› ๋Œ€๋น„ ๋‹จ์œ„ ๊ณฑ์…ˆ๋‹น ์—ฐ์‚ฐ ๋น„์šฉ์„ 4.68~30.3% ์ ˆ๊ฐํ•˜์˜€์œผ๋ฉฐ ์—๋„ˆ์ง€ ๋น„์šฉ์„ 43.3% ์ ˆ๊ฐํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šคํ† ์บ์Šคํ‹ฑ ์ปดํ“จํŒ… (SC) ๊ธฐ๋ฐ˜ MAC ์—ฐ์‚ฐ ์œ ๋‹›์˜ ์—ฐ์‚ฐ ์‚ฌ์ดํด ์ ˆ๊ฐ์„ ์œ„ํ•œ ๊ธฐ๋ฒ• ๋ฐ ์—ฐ๊ด€๋œ ํ•˜๋“œ์›จ์–ด ์œ ๋‹› ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. FSM์œผ๋กœ ์ œ์–ด๋˜๋Š” MUX๋ฅผ ํ†ตํ•ด ์ž…๋ ฅ ์ด์ง„์ˆ˜์—์„œ ๋งŒ๋“  ๋น„ํŠธ ์ˆ˜์—ด์„ ์„ธ์–ด MAC ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋Š” MUX-FSM ๊ธฐ๋ฐ˜ SC๋Š” ๊ธฐ์กด ์Šคํ† ์บ์Šคํ‹ฑ ์ˆซ์ž ์ƒ์„ฑ๊ธฐ ๊ธฐ๋ฐ˜ SC ๋Œ€๋น„ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ์„ ์ƒ๋‹นํžˆ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ˜„์žฌ MUX-FSM ๊ธฐ๋ฐ˜ SC๋Š” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„๊ณผ ๋ณ„๊ฐœ๋กœ ์—ฌ์ „ํžˆ ๋‹ค์ˆ˜์˜ ์—ฐ์‚ฐ ์‚ฌ์ดํด์„ ์š”๊ตฌํ•˜์—ฌ ์˜จ-๋””๋ฐ”์ด์Šค ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ๊ธฐ์— ์ ์šฉ๋˜๊ธฐ ์–ด๋ ค์› ๋‹ค. ๋˜ํ•œ, ๊ธฐ์กด์— ์ œ์•ˆ๋œ ๋Œ€์•ˆ์€ ์ œ๊ฐ๊ธฐ ์ ˆ๊ฐ ํšจ๊ณผ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๊ฑฐ๋‚˜ ๋ชจ๋ธ ๋ณ€์ˆ˜ ๋ณ€ํ™˜ ๋น„์šฉ์ด ์žˆ๋Š” ๋“ฑ ํ•œ๊ณ„์ ์ด ์žˆ์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด MUX-FSM ๊ธฐ๋ฐ˜ SC์˜ ์ถ”๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. MUX-FSM ๊ธฐ๋ฐ˜ SC์˜ ๋น„ํŠธ ์ง‘๊ณ„ ํŒจํ„ด์„ ํŒŒ์•…ํ•˜๊ณ , ์ค‘๋ณต ์ง‘๊ณ„๋ฅผ ์‹œํ”„ํŠธ ์—ฐ์‚ฐ์œผ๋กœ ๊ต์ฒดํ•˜์˜€๋‹ค. ์ด๋กœ๋ถ€ํ„ฐ ํ•„์š” ๋น„ํŠธ ํŒจํ„ด์˜ ๊ธธ์ด๋ฅผ ํฌ๊ฒŒ ์ค„์ด๋ฉฐ, ๊ณฑ์…ˆ ์—ฐ์‚ฐ ์ค‘ ์ตœ์•…์˜ ๊ฒฝ์šฐ์˜ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ์ด๋ก ์ ์œผ๋กœ 2๋ฐฐ ์ด์ƒ ํ–ฅ์ƒํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ์ œ์•ˆํ•œ ๊ฐœ์„ ๋œ SC ๊ธฐ๋ฒ•์ด ๊ธฐ์กดMUX-FSM ๊ธฐ๋ฐ˜ SC ๋Œ€๋น„ ํ‰๊ท  ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ 38.8% ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.1 INTRODUCTION 1 1.1 Neural network accelerator and its optimizations 1 1.2 Necessity of optimizing computational block of neural network accelerator 5 1.3 Contributions of This Dissertation 7 2 MAC Design Considering Mixed Precision 9 2.1 Motivation 9 2.2 Internal Multiplier Size Determination 14 2.3 Proposed hardware structure 16 2.4 Experiments 21 2.4.1 Implementation of Reference MAC units 23 2.4.2 Area, Wirelength, Power, Energy, and Performance of MAC units for AlexNet 24 2.4.3 Area, Wirelength, Power, Energy, and Performance of MAC units for VGG-16 31 2.4.4 Power Saving by Clock Gating 35 3 Speeding up MUX-FSM based Stochastic Computing Unit Design 37 3.1 Motivations 37 3.1.1 MUX-FSM based SC and previous enhancements 42 3.2 The Proposed MUX-FSM based SC 48 3.2.1 Refined Algorithm for Stochastic Computing 48 3.3 The Supporting Hardware Architecture 55 3.3.1 Bit Counter with shift operation 55 3.3.2 Controller 57 3.3.3 Combining with preceding architectures 58 3.4 Experiments 59 3.4.1 Experiments Setup 59 3.4.2 Generating input bit selection pattern 60 3.4.3 Performance Comparison 61 3.4.4 Hardware Area and Energy Comparison 63 4 CONCLUSIONS 67 4.1 MAC Design Considering Mixed Precision 67 4.2 Speeding up MUX-FSM based Stochastic Computing Unit Design 68 Abstract (In Korean) 73Docto

    Efficient end-to-end learning for quantizable representations

    Full text link
    Embedding representation learning via neural networks is at the core foundation of modern similarity based search. While much effort has been put in developing algorithms for learning binary hamming code representations for search efficiency, this still requires a linear scan of the entire dataset per each query and trades off the search accuracy through binarization. To this end, we consider the problem of directly learning a quantizable embedding representation and the sparse binary hash code end-to-end which can be used to construct an efficient hash table not only providing significant search reduction in the number of data but also achieving the state of the art search accuracy outperforming previous state of the art deep metric learning methods. We also show that finding the optimal sparse binary hash code in a mini-batch can be computed exactly in polynomial time by solving a minimum cost flow problem. Our results on Cifar-100 and on ImageNet datasets show the state of the art search accuracy in precision@k and NMI metrics while providing up to 98X and 478X search speedup respectively over exhaustive linear search. The source code is available at https://github.com/maestrojeong/Deep-Hash-Table-ICML18Comment: Accepted and to appear at ICML 2018. Camera ready versio

    A system for learning statistical motion patterns

    Get PDF
    Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy k-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction
    • โ€ฆ
    corecore