423 research outputs found

    Contextual Bandit Modeling for Dynamic Runtime Control in Computer Systems

    Get PDF
    Modern operating systems and microarchitectures provide a myriad of mechanisms for monitoring and affecting system operation and resource utilization at runtime. Dynamic runtime control of these mechanisms can tailor system operation to the characteristics and behavior of the current workload, resulting in improved performance. However, developing effective models for system control can be challenging. Existing methods often require extensive manual effort, computation time, and domain knowledge to identify relevant low-level performance metrics, relate low-level performance metrics and high-level control decisions to workload performance, and to evaluate the resulting control models. This dissertation develops a general framework, based on the contextual bandit, for describing and learning effective models for runtime system control. Random profiling is used to characterize the relationship between workload behavior, system configuration, and performance. The framework is evaluated in the context of two applications of progressive complexity; first, the selection of paging modes (Shadow Paging, Hardware-Assisted Page) in the Xen virtual machine memory manager; second, the utilization of hardware memory prefetching for multi-core, multi-tenant workloads with cross-core contention for shared memory resources, such as the last-level cache and memory bandwidth. The resulting models for both applications are competitive in comparison to existing runtime control approaches. For paging mode selection, the resulting model provides equivalent performance to the state of the art while substantially reducing the computation requirements of profiling. For hardware memory prefetcher utilization, the resulting models are the first to provide dynamic control for hardware prefetchers using workload statistics. Finally, a correlation-based feature selection method is evaluated for identifying relevant low-level performance metrics related to hardware memory prefetching

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression

    Hardware Considerations for Signal Processing Systems: A Step Toward the Unconventional.

    Full text link
    As we progress into the future, signal processing algorithms are becoming more computationally intensive and power hungry while the desire for mobile products and low power devices is also increasing. An integrated ASIC solution is one of the primary ways chip developers can improve performance and add functionality while keeping the power budget low. This work discusses ASIC hardware for both conventional and unconventional signal processing systems, and how integration, error resilience, emerging devices, and new algorithms can be leveraged by signal processing systems to further improve performance and enable new applications. Specifically this work presents three case studies: 1) a conventional and highly parallel mix signal cross-correlator ASIC for a weather satellite performing real-time synthetic aperture imaging, 2) an unconventional native stochastic computing architecture enabled by memristors, and 3) two unconventional sparse neural network ASICs for feature extraction and object classification. As improvements from technology scaling alone slow down, and the demand for energy efficient mobile electronics increases, such optimization techniques at the device, circuit, and system level will become more critical to advance signal processing capabilities in the future.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116685/1/knagphil_1.pd

    ํšจ์œจ์ ์ธ ์ถ”๋ก ์„ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด ์นœํ™”์  ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ ๋ฐ ๊ฐ€์†๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ดํ˜์žฌ.๋จธ์‹  ๋Ÿฌ๋‹ (Machine Learning) ๋ฐฉ๋ฒ• ์ค‘ ํ˜„์žฌ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋Š” ๋”ฅ๋Ÿฌ๋‹(Deep Learning)์— ๊ด€ํ•œ ์—ฐ๊ตฌ๋“ค์ด ํ•˜๋“œ์›จ์–ด์™€ ์†Œํ”„ํŠธ์›จ์–ด ๋‘ ์ธก๋ฉด์—์„œ ๋ชจ๋‘ ํ™œ๋ฐœํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ์ถ”๋ก ์„ ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋ชจ๋ฐ”์ผ์šฉ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ(Neural Network Architecture) ์„ค๊ณ„ ๋ฐ ํ•™์Šต๋œ ๋ชจ๋ธ ์••์ถ• ๋“ฑ ์†Œํ”„ํŠธ์›จ์–ด ์ธก๋ฉด์—์„œ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์ด ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋ฏธ ํ•™์Šต๋œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๋น ๋ฅธ ์ถ”๋ก ๊ณผ ๋†’์€ ์—๋„ˆ์ง€ํšจ์œจ์„ฑ์„ ๊ฐ–๋Š” ๊ฐ€์†๊ธฐ๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์ธก๋ฉด์—์„œ์˜ ์—ฐ๊ตฌ๊ฐ€ ๋™์‹œ์— ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์กด์˜ ์ตœ์ ํ™” ๋ฐ ์„ค๊ณ„ ๋ฐฉ๋ฒ•์—์„œ ๋” ๋‚˜์•„๊ฐ€ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ƒˆ๋กœ์šด ํ•˜๋“œ์›จ์–ด ์„ค๊ณ„ ๊ธฐ์ˆ ๊ณผ ๋ชจ๋ธ ๋ณ€ํ™˜ ๋ฐฉ๋ฒ• ๋“ฑ์„ ์ ์šฉํ•˜์—ฌ ๋” ํšจ์œจ์ ์ธ ์ถ”๋ก  ์‹œ์Šคํ…œ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ, ์ƒˆ๋กœ์šด ํ•˜๋“œ์›จ์–ด ์„ค๊ณ„ ๋ฐฉ๋ฒ•์ธ ํ™•๋ฅ  ์ปดํ“จํŒ…(Stochastic computing)์„ ๋„์ž…ํ•˜์—ฌ ๋” ํšจ์œจ์ ์ธ ๋”ฅ๋Ÿฌ๋‹ ๊ฐ€์† ํ•˜๋“œ์›จ์–ด๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. ํ™•๋ฅ  ์ปดํ“จํŒ…์€ ํ™•๋ฅ  ์—ฐ์‚ฐ์— ๊ธฐ๋ฐ˜์„ ๋‘” ์ƒˆ๋กœ์šด ํšŒ๋กœ ์„ค๊ณ„ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ธฐ์กด์˜ ์ด์ง„ ์—ฐ์‚ฐ ํšŒ๋กœ(Binary system)๋ณด๋‹ค ํ›จ์”ฌ ๋” ์ ์€ ํŠธ๋žœ์ง€์Šคํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ์—ฐ์‚ฐ ํšŒ๋กœ๋ฅผ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ํŠนํžˆ, ๋”ฅ๋Ÿฌ๋‹์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ณฑ์…ˆ ์—ฐ์‚ฐ์„ ์œ„ํ•˜์—ฌ ์ด์ง„ ์—ฐ์‚ฐ ํšŒ๋กœ์—์„œ๋Š” ๋ฐฐ์—ด ์Šน์‚ฐ๊ธฐ(Array Multiplier)๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€๋งŒ ํ™•๋ฅ  ์ปดํ“จํŒ…์—์„œ๋Š” AND ๊ฒŒ์ดํŠธํ•˜๋‚˜๋กœ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์„ ํ–‰ ์—ฐ๊ตฌ๋“ค์ด ํ™•๋ฅ  ์ปดํ“จํŒ… ํšŒ๋กœ๋ฅผ ๊ธฐ๋ฐ˜ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ฐ€์†๊ธฐ๋“ค์„ ์„ค๊ณ„ํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์ธ์‹๋ฅ ์ด ์ด์ง„ ์—ฐ์‚ฐ ํšŒ๋กœ์— ๋น„ํ•˜์—ฌ ๋งŽ์ด ๋’ค์ณ์ง€๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฐ์‚ฐ์˜ ์ •ํ™•๋„๋ฅผ ๋” ๋†’์ผ ์ˆ˜ ์žˆ๋„๋ก ๋‹จ๊ทน์„ฑ ๋ถ€ํ˜ธํ™”(Unipolar encoding) ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ€์†๊ธฐ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๊ณ , ํ™•๋ฅ  ์ปดํ“จํŒ… ์ˆซ์ž ์ƒ์„ฑ๊ธฐ (Stochastic number generator)์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•˜์—ฌ ํ™•๋ฅ  ์ปดํ“จํŒ… ์ˆซ์ž ์ƒ์„ฑ๊ธฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋‰ด๋Ÿฐ์ด ๊ณต์œ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ, ๋” ๋†’์€ ์ถ”๋ก  ์†๋„ ํ–ฅ์ƒ์„ ์œ„ํ•˜์—ฌ ํ•™์Šต๋œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์••์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ• ๋Œ€์‹ ์— ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋ฅผ ๋ณ€ํ™˜ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ์„ ํ–‰ ์—ฐ๊ตฌ๋“ค์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์••์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ตœ์‹  ๊ตฌ์กฐ๋“ค์— ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ๊ฐ€์ค‘์น˜ ํŒŒ๋ผ๋ฏธํ„ฐ(Weight Parameter)์—๋Š” ๋†’์€ ์••์ถ•๋ฅ ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ ์‹ค์ œ ์ถ”๋ก  ์†๋„ ํ–ฅ์ƒ์—๋Š” ๋ฏธ๋ฏธํ•œ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์‹ค์งˆ์ ์ธ ์†๋„ ํ–ฅ์ƒ์ด ๋ฏธํกํ•œ ๊ฒƒ์€ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ตฌ์กฐ์ƒ์˜ ํ•œ๊ณ„์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์ด๊ณ , ์ด๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ๋Š”๊ฒƒ์ด ๊ฐ€์žฅ ๊ทผ๋ณธ์ ์ธ ํ•ด๊ฒฐ์ฑ…์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€์ฐฐ ๊ฒฐ๊ณผ๋ฅผ ํ† ๋Œ€๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ ํ–‰์—ฐ๊ตฌ๋ณด๋‹ค ๋” ๋†’์€ ์†๋„ ํ–ฅ์ƒ์„ ์œ„ํ•˜์—ฌ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๊ฐ ์ธต๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋„๋ก ํƒ์ƒ‰ ๋ฒ”์œ„๋ฅผ ๋” ํ™•์žฅ์‹œํ‚ค๋ฉด์„œ๋„ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ ํƒ์ƒ‰ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ์˜ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ ํƒ์ƒ‰์€ ๊ธฐ๋ณธ ๋‹จ์œ„์ธ ์…€(Cell)์˜ ๊ตฌ์กฐ๋ฅผ ํƒ์ƒ‰ํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ณต์‚ฌํ•˜์—ฌ ํ•˜๋‚˜์˜ ํฐ ์‹ ๊ฒฝ๋ง์œผ๋กœ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•œ๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์€ ํ•˜๋‚˜์˜ ์…€ ๊ตฌ์กฐ๋งŒ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์œ„์น˜์— ๋”ฐ๋ฅธ ์ž…๋ ฅ ํŠน์„ฑ๋งต(Input Feature Map)์˜ ํฌ๊ธฐ๋‚˜ ๊ฐ€์ค‘์น˜ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ํฌ๊ธฐ ๋“ฑ์— ๊ด€ํ•œ ์ •๋ณด๋Š” ๋ฌด์‹œํ•˜๊ฒŒ ๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ๋“ค์„ ํ•ด๊ฒฐํ•˜๋ฉด์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์—ฐ์‚ฐ๋Ÿ‰๋ฟ๋งŒ์•„๋‹ˆ๋ผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšŸ์ˆ˜์˜ ์ œ์•ฝ์„ ์ฃผ์–ด ๋” ํšจ์œจ์ ์ธ ๊ตฌ์กฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๋Š” ํŽ˜๋„ํ‹ฐ(Penalty)๋ฅผ ์ƒˆ๋กœ์ด ๊ณ ์•ˆํ•˜์˜€๋‹ค.Deep learning is the most promising machine learning algorithm, and it is already used in real life. Actually, the latest smartphone use a neural network for better photograph and voice recognition. However, as the performance of the neural network improved, the hardware cost dramatically increases. Until the past few years, many researches focus on only a single side such as hardware or software, so its actual cost is hardly improved. Therefore, hardware and software co-optimization is needed to achieve further improvement. For this reason, this dissertation proposes the efficient inference system considering the hardware accelerator to the network architecture design. The first part of the dissertation is a deep neural network accelerator with stochastic computing. The main goal is the efficient stochastic computing hardware design for a convolutional neural network. It includes stochastic ReLU and optimized max function, which are key components in the convolutional neural network. To avoid the range limitation problem of stochastic numbers and increase the signal-to-noise ratio, we perform weight normalization and upscaling. In addition, to reduce the overhead of binary-to-stochastic conversion, we propose a scheme for sharing stochastic number generators among the neurons in the convolutional neural network. The second part of the dissertation is a neural architecture transformation. The network recasting is proposed, and it enables the network architecture transformation. The primary goal of this method is to accelerate the inference process through the transformation, but there can be many other practical applications. The method is based on block-wise recasting; it recasts each source block in a pre-trained teacher network to a target block in a student network. For the recasting, a target block is trained such that its output activation approximates that of the source block. Such a block-by-block recasting in a sequential manner transforms the network architecture while preserving accuracy. This method can be used to transform an arbitrary teacher network type to an arbitrary student network type. It can even generate a mixed-architecture network that consists of two or more types of block. The network recasting can generate a network with fewer parameters and/or activations, which reduce the inference time significantly. Naturally, it can be used for network compression by recasting a trained network into a smaller network of the same type. The third part of the dissertation is a fine-grained neural architecture search. InheritedNAS is the fine-grained architecture search method, and it uses the coarsegrained architecture that is found from the cell-based architecture search. Basically, fine-grained architecture has a very large search space, so it is hard to find directly. A stage independent search is proposed, and this method divides the entire network to several stages and trains each stage independently. To break the dependency between each stage, a two-point matching distillation method is also proposed. And then, operation pruning is applied to remove the unimportant operation. The block-wise pruning method is used to remove the operations rather than the node-wise pruning. In addition, a hardware-aware latency penalty is proposed, and it covers not only FLOPs but also memory access.1 Introduction 1 1.1 DNN Accelerator with Stochastic Computing 2 1.2 Neural Architecture Transformation 4 1.3 Fine-Grained Neural Architecture Search 6 2 Background 8 2.1 Stochastic Computing 8 2.2 Neural Network 10 2.2.1 Network Compression 10 2.2.2 Neural Network Accelerator 13 2.3 Knowledge Distillation 17 2.4 Neural Architecture Search 19 3 DNN Accelerator with Stochastic Computing 23 3.1 Motivation 23 3.1.1 Multiplication Error on Stochastic Computing 23 3.1.2 DNN with Stochastic Computing 24 3.2 Unipolar SC Hardware for CNN 25 3.2.1 Overall Hardware Design 25 3.2.2 Stochastic ReLU function 27 3.2.3 Stochastic Max function 30 3.2.4 Efficient Average Function 36 3.3 Weight Modulation for SC Hardware 38 3.3.1 Weight Normalization for SC 38 3.3.2 Weight Upscaling for Output Layer 43 3.4 Early Decision Termination 44 3.5 Stochastic Number Generator Sharing 49 3.6 Experiments 53 3.6.1 Accuracy of CNN using Unipolar SC 53 3.6.2 Synthesis Result 57 3.7 Summary 58 4 Neural Architecture Transformation 59 4.1 Motivation 59 4.2 Network Recasting 61 4.2.1 Recasting from DenseNet to ResNet and ConvNet 63 4.2.2 Recasting from ResNet to ConvNet 63 4.2.3 Compression 63 4.2.4 Block Training 65 4.2.5 Sequential Recasting and Fine-tuning 67 4.3 Experiments 69 4.3.1 Visualization of Filter Reduction 70 4.3.2 CIFAR 71 4.3.3 ILSVRC2012 73 4.4 Summary 76 5 Fine-Grained Neural Architecture Search 77 5.1 Motivation 77 5.1.1 Search Space Reduction Versus Diversity 77 5.1.2 Hardware-Aware Optimization 78 5.2 InheritedNAS 79 5.2.1 Stage Independent Search 79 5.2.2 Operation Pruning 82 5.2.3 Entire Search Procedure 83 5.3 Hardware-aware Penalty Design 85 5.4 Experiments 87 5.4.1 Fine-Grained Architecture Search 88 5.4.2 Penalty Analysis 90 5.5 Summary 92 6 Conclusion 93 Abstract (In Korean) 113Docto

    Comparative study of state-of-the-art machine learning models for analytics-driven embedded systems

    Get PDF
    Analytics-driven embedded systems are gaining foothold faster than ever in the current digital era. The innovation of Internet of Things(IoT) has generated an entire ecosystem of devices, communicating and exchanging data automatically in an interconnected global network. The ability to efficiently process and utilize the enormous amount of data being generated from an ensemble of embedded devices like RFID tags, sensors etc., enables engineers to build smart real-world systems. Analytics-driven embedded system explores and processes the data in-situ or remotely to identify a pattern in the behavior of the system and in turn can be used to automate actions and embark decision making capability to a device. Designing an intelligent data processing model is paramount for reaping the benefits of data analytics, because a poorly designed analytics infrastructure would degrade the systemโ€™s performance and effectiveness. There are many different aspects of this data that make it a more complex and challenging analytics task and hence a suitable candidate for big data. Big data is mainly characterized by its high volume, hugely varied data types and high speed of data receipt; all these properties mandate the choice of correct data mining techniques to be used for designing the analytics model. Datasets with images like face recognition, satellite images would perform better with deep learning algorithms, time-series datasets like sensor data from wearable devices would give better results with clustering and supervised learning models. A regression model would suit best for a multivariate dataset like appliances energy prediction data, forest fire data etc. Each machine learning task has a varied range of algorithms which can be used in combination to create an intelligent data analysis model. In this study, a comprehensive comparative analysis was conducted using different datasets freely available on online machine learning repository, to analyze the performance of state-of-art machine learning algorithms. WEKA data mining toolkit was used to evaluate C4.5, Naรฏve Bayes, Random Forest, kNN, SVM and Multilayer Perceptron for classification models. Linear regression, Gradient Boosting Machine(GBM), Multilayer Perceptron, kNN, Random Forest and Support Vector Machines (SVM) were applied to dataset fit for regression machine learning. Datasets were trained and analyzed in different experimental setups and a qualitative comparative analysis was performed with k-fold Cross Validation(CV) and paired t-test in Weka experimenter

    On Information-centric Resiliency and System-level Security in Constrained, Wireless Communication

    Get PDF
    The Internet of Things (IoT) interconnects many heterogeneous embedded devices either locally between each other, or globally with the Internet. These things are resource-constrained, e.g., powered by battery, and typically communicate via low-power and lossy wireless links. Communication needs to be secured and relies on crypto-operations that are often resource-intensive and in conflict with the device constraints. These challenging operational conditions on the cheapest hardware possible, the unreliable wireless transmission, and the need for protection against common threats of the inter-network, impose severe challenges to IoT networks. In this thesis, we advance the current state of the art in two dimensions. Part I assesses Information-centric networking (ICN) for the IoT, a network paradigm that promises enhanced reliability for data retrieval in constrained edge networks. ICN lacks a lower layer definition, which, however, is the key to enable device sleep cycles and exclusive wireless media access. This part of the thesis designs and evaluates an effective media access strategy for ICN to reduce the energy consumption and wireless interference on constrained IoT nodes. Part II examines the performance of hardware and software crypto-operations, executed on off-the-shelf IoT platforms. A novel system design enables the accessibility and auto-configuration of crypto-hardware through an operating system. One main focus is the generation of random numbers in the IoT. This part of the thesis further designs and evaluates Physical Unclonable Functions (PUFs) to provide novel randomness sources that generate highly unpredictable secrets, on low-cost devices that lack hardware-based security features. This thesis takes a practical view on the constrained IoT and is accompanied by real-world implementations and measurements. We contribute open source software, automation tools, a simulator, and reproducible measurement results from real IoT deployments using off-the-shelf hardware. The large-scale experiments in an open access testbed provide a direct starting point for future research

    Energy Efficient Neocortex-Inspired Systems with On-Device Learning

    Get PDF
    Shifting the compute workloads from cloud toward edge devices can significantly improve the overall latency for inference and learning. On the contrary this paradigm shift exacerbates the resource constraints on the edge devices. Neuromorphic computing architectures, inspired by the neural processes, are natural substrates for edge devices. They offer co-located memory, in-situ training, energy efficiency, high memory density, and compute capacity in a small form factor. Owing to these features, in the recent past, there has been a rapid proliferation of hybrid CMOS/Memristor neuromorphic computing systems. However, most of these systems offer limited plasticity, target either spatial or temporal input streams, and are not demonstrated on large scale heterogeneous tasks. There is a critical knowledge gap in designing scalable neuromorphic systems that can support hybrid plasticity for spatio-temporal input streams on edge devices. This research proposes Pyragrid, a low latency and energy efficient neuromorphic computing system for processing spatio-temporal information natively on the edge. Pyragrid is a full-scale custom hybrid CMOS/Memristor architecture with analog computational modules and an underlying digital communication scheme. Pyragrid is designed for hierarchical temporal memory, a biomimetic sequence memory algorithm inspired by the neocortex. It features a novel synthetic synapses representation that enables dynamic synaptic pathways with reduced memory usage and interconnects. The dynamic growth in the synaptic pathways is emulated in the memristor device physical behavior, while the synaptic modulation is enabled through a custom training scheme optimized for area and power. Pyragrid features data reuse, in-memory computing, and event-driven sparse local computing to reduce data movement by ~44x and maximize system throughput and power efficiency by ~3x and ~161x over custom CMOS digital design. The innate sparsity in Pyragrid results in overall robustness to noise and device failure, particularly when processing visual input and predicting time series sequences. Porting the proposed system on edge devices can enhance their computational capability, response time, and battery life

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ร—2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ร—2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimizationโ€™s robustness by ร—2.35

    Advances in Human-Robot Interaction

    Get PDF
    Rapid advances in the field of robotics have made it possible to use robots not just in industrial automation but also in entertainment, rehabilitation, and home service. Since robots will likely affect many aspects of human existence, fundamental questions of human-robot interaction must be formulated and, if at all possible, resolved. Some of these questions are addressed in this collection of papers by leading HRI researchers
    • โ€ฆ
    corecore