366 research outputs found

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    Mapping and Optimizing Communication in ROS 2-based Applications on Configurable System-on-Chip Platforms

    Full text link
    The robot operating system is the de-facto standard for designing and implementing robotics applications. Several previous works deal with the integration of heterogeneous accelerators into ROS-based applications. One of these approaches is ReconROS, which enables nodes to be completely mapped to hardware. The follow-up work fpgaDDS extends ReconROS by an intra-FPGA data distribution service to process topic-based communication between nodes entirely in hardware. However, the application of this approach is strictly limited to communication between nodes implemented in hardware only. This paper introduces gateways to close the gap between topic communication in hardware and software. Gateways aim to reduce data transfers between hardware and software by synchronizing a hardware-and software-mapped topic. As a result, data must be transferred only once compared to a separate data transmission for each subscribing hardware node in the baseline. Our measurements show significant speedups in multi-subscriber scenarios with large message sizes. From the conclusions of these measurements, we present a methodology for the communication mapping of ROS 2 computation graphs. In the evaluation, an autonomous driving real-world example benefits from the gateway and achieves a speedup of 1.4

    HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

    Get PDF
    With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.Comment: To appear in the 2019 25th International Symposium on High-Performance Computer Architecture (HPCA 2019

    ๊ณ ์„ฑ๋Šฅ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ฐ ์ปดํ“จํŒ… ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€ํƒœํ™˜.์ธ๊ณต ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•˜๋Š” ์ˆ˜์š”๊ฐ€ ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ๊นŠ์€ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์—๋Š” ๊ณผ๋„ํ•œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ˆ˜๋ฐ˜๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ ์„ค๊ณ„ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์ถ”๋ก  ์—ฐ์‚ฐ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๊ธฐ์ˆ ์„ ์—ฐ๊ตฌํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์ตœ๋Œ€ ๊ณ„์‚ฐ ์†๋„ ํ–ฅ์ƒ์ด ๊ฐ€์ค‘์น˜์˜ 0 ์•„๋‹Œ ๋น„ํŠธ์˜ ์ด ์ˆ˜์— ์˜ํ•ด ์ œํ•œ๋˜๋Š” ํ•œ๊ณ„์˜ ๊ทน๋ณต์„ ์‹œ๋„ํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋ถ€ํ˜ธ์žˆ๋Š” ์ˆซ์ž ์ธ์ฝ”๋”ฉ์— ๊ธฐ๋ฐ˜ํ•œ ๋ณธ ์—ฐ๊ตฌ์—์„œ, (1) ๋ชจ๋“  ๊ฐ€์ค‘์น˜์˜ 2์˜ ๋ณด์ˆ˜ ํ‘œํ˜„์„ ํ•„์ˆ˜ ๋น„ํŠธ๋ฅผ ์ตœ์†Œ๋กœ ํ•˜๋Š” ๋ถ€ํ˜ธ์žˆ๋Š” ์ˆซ์ž ํ‘œํ˜„์˜ ์ง‘ํ•ฉ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ณ€ํ™˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉฐ, (2) ๊ฐ€์ค‘์น˜์˜ ๋น„ํŠธ ๋‹จ์œ„ ๊ณฑ์…ˆ์˜ ๋ณ‘๋ ฌ์„ฑ์„ ์ตœ๋Œ€ํ•˜ํ™”๋Š” ๊ฐ€์ค‘์น˜์˜ ๋ถ€ํ˜ธ์žˆ๋Š” ์ˆซ์ž ํ‘œํ˜„์„ ์„ ํƒํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์ˆซ์ž ์ธ๋ฑ์Šค (์—ด ๋‹จ์œ„) ์••์ถ• ์ตœ๋Œ€ํ™”๋ฅผ ๋‹ฌ์„ฑํ•˜๋„๋ก ๋‹ค๋ชฉ์  ์ตœ๋‹จ ๊ฒฝ๋กœ ๋ฌธ์ œ๋กœ ๊ณต์‹ํ™”ํ•˜์—ฌ ๊ทผ์‚ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋ฉฐ, (3) ์ฃผ์š” ํ•˜๋“œ์›จ์–ด๋ฅผ ์ถ”๊ฐ€๋กœ ํฌํ•จํ•˜์ง€ ์•Š๊ณ  ์•ž์„œ ์ œ์•ˆํ•œ ๊ธฐ๋ฒ•์„ ์ง€์›ํ•˜๋Š” ์ƒˆ๋กœ์šด ๊ฐ€์†๊ธฐ ์•„ํ‚คํ…์ฒ˜(DWP)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” (4) ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์—์„œ ์ตœ์•…์˜ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์—„๊ฒฉํ•˜๊ฒŒ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์ด ํฌํ•จ๋œ ๋น„ํŠธ ๋‹จ์œ„ ๋ณ‘๋ ฌ ๊ณฑ์…ˆ์„ ์ง€์›ํ•˜๋„๋ก ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ DWP๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด ๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ํ•„์ˆ˜ ๋น„ํŠธ ์ˆ˜๋ฅผ AlexNet์—์„œ 69%, VGG-16์—์„œ 74%, ResNet-152์—์„œ 68%๊นŒ์ง€ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฅผ ์ง€์›ํ•˜๋Š” ๊ฐ€์†๊ธฐ๋Š” ์ถ”๋ก  ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๊ธฐ์กด์˜ ๋น„ํŠธ ๋‹จ์œ„ ๊ฐ€์ค‘์น˜ ๊ฐ€์ง€์น˜๊ธฐ ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ์ตœ๋Œ€ 3.57๋ฐฐ๊นŒ์ง€ ๊ฐ์†Œ์‹œ์ผฐ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์ด์ง„ ๋ฐ ์‚ผ์ง„ ๊ฐ€์ค‘์น˜์˜ ์ปจ๋ณผ๋ฃจ์…˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์—์„œ ์ปจ๋ณผ๋ฃจ์…˜ ๊ฐ„์˜ ์ค‘๋ณต ์—ฐ์‚ฐ์„ ์ตœ๋Œ€ํ•œ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๊ณตํ†ต ์ปค๋„ ๋ฐ ์ปจ๋ณผ๋ฃจ์…˜์„ ์ถ”์ถœํ•˜๋Š” ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, (1) ๊ธฐ์กด ๋ฐฉ๋ฒ•์—์„œ ๊ณตํ†ต ์ปค๋„ ํ›„๋ณด์˜ ๊ตญ๋ถ€์ ์ด๊ณ  ์ œํ•œ์ ์ธ ํƒ์ƒ‰์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๊ณตํ†ต ์ปค๋„ ์ถ”์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•˜๊ณ , ์ดํ›„์— (2) ์ปจ๋ณผ๋ฃจ์…˜ ์—ฐ์‚ฐ์—์„œ์˜ ์ค‘๋ณต์„ฑ์„ ์ตœ๋Œ€ํ•œ์œผ๋กœ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๊ฐœ๋…์˜ ๊ณตํ†ต ์ปจ๋ณผ๋ฃจ์…˜ ์ถ”์ถœ์„ ์ ์šฉํ•œ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ (3) ์ปจ๋ณผ๋ฃจ์…˜์— ๋Œ€ํ•ด ์ตœ์ข…์ ์œผ๋กœ ๋„์ถœ๋œ ์ปค๋„ ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ์ปค๋„์— ๋Œ€ํ•œ ์ด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค. ์‚ผ์ง„ ๊ฐ€์ค‘์น˜์˜ VGG-16์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋กœ ๋ชจ๋“  ์ปจ๋ณผ๋ฃจ์…˜์— ๋Œ€ํ•œ ์ด ์—ฐ์‚ฐ ์ˆ˜๋ฅผ 25.8-26.3% ๊ฐ์†Œ์‹œ์ผœ, ์ตœ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ถ”์ถœํ•œ ๊ณตํ†ต ์ปค๋„์„ ์‚ฌ์šฉํ•˜๋Š” ์ปจ๋ณผ๋ฃจ์…˜์— ๋น„ํ•ด 2.7-3.8% ๋” ์ ์€ ์ปค๋„์„ ์‚ฌ์šฉํ•˜๋Š” ๋™์•ˆ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์—์„œ์˜ ์ด ์ˆ˜ํ–‰ ์‚ฌ์ดํด์„ 22.4% ๊ฐ์†Œ์‹œํ‚ด์œผ๋กœ์จ ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•œ ์ปจ๋ณผ๋ฃจ์…˜ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋งค์šฐ ํšจ๊ณผ์ ์ž„์„ ๋ณด์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์••์ถ•๋œ DNN์˜ ๋ชจ๋“  ๊ณ ์œ  ๊ฐ€์ค‘์น˜๋“ค์„ ์˜จ-์นฉ ๋ฉ”๋ชจ๋ฆฌ์— ์™„์ „ํžˆ ํฌํ•จํ•  ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ ์ •ํ™•๋„ ์œ ์ง€๋ฅผ ์œ„ํ•ด ๋ถ€์ ํ•ฉ ์••์ถ•์„ ์‚ฌ์šฉํ•˜๋Š” DNN ์†”๋ฃจ์…˜์„ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๊ฐ€์ค‘์น˜์˜ ์ ‘๊ทผ ์‹œํ€€์Šค๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, (1) ์ฒซ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” ์˜คํ”„-์นฉ ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ˆ˜(์ ‘๊ทผ์— ์˜ํ•ด ์†Œ๋น„๋˜๋Š” ์—๋„ˆ์ง€)๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ์˜คํ”„-์นฉ ๋ฉ”๋ชจ๋ฆฌ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐฐ์—ดํ•˜๋Š” ๊ฒƒ์ด๊ณ , (2) ๋‘ ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” ๋ธ”๋ก ๊ต์ฒด๋ฅผ ์œ„ํ•œ ์ธ๋ฑ์Šค ํƒ์ƒ‰์— ์†Œ๋น„๋˜๋Š” ์˜ค๋ฒ„ํ—ค๋“œ์™€ ์˜คํ”„-์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ์†Œ๋ชจ๋˜๋Š” ์ด ์—๋„ˆ์ง€์˜ ์ตœ์†Œํ™”๋ฅผ ๋ชฉ์ ์œผ๋กœ ํ•˜์—ฌ ๋ธ”๋ก ๋ฏธ์Šค ๋ฐœ์ƒ ์‹œ ์˜จ-์นฉ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๊ต์ฒด๋  ๊ฐ€์ค‘์น˜ ๋ธ”๋ก์„ ์„ ํƒํ•˜๋Š” ์ „๋žต์„ ๊ณ ์•ˆํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์••์ถ•๋œ AlexNet ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์šฐ๋ฆฌ์˜ ์†”๋ฃจ์…˜์€ ์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ฐ LRU ๊ต์ฒด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์— ๋น„ํ•ด ํƒ์ƒ‰ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ํฌํ•จํ•˜์—ฌ ์˜คํ”„-์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ํ•„์š”ํ•œ ์ด ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ ํ‰๊ท  34.2%๊นŒ์ง€ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค.Although the demand for exploiting neural networks is steadily increasing, there are many design challenges since deep neural networks (DNNs) entail excessive memory and computation cost. This dissertation studies a number of new techniques for effectively processing DNN inference operations. Firstly, we attempt to overcome that the maximal computation speedup is bounded by the total number of non-zero bits of the weights. Precisely, this work, based on the signed-digit encoding, (1) proposes a transformation technique which converts the twos complement representation of every weight into a set of signed-digit representations of the minimal number of essential bits, (2) formulates the problem of selecting signed-digit representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem to achieve a maximal digit-index by digit-index (i.e. column-wise) compression for the weights and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture (DWP) with no additional inclusion of non-trivial hardware. In addition, we (4) propose a variant of DWP to support bit-level parallel multiplication with the capability of predicting a tight worst-case latency of the parallel processing. Through experiments on several representative models using the ImageNet dataset, it is shown that our proposed approach is able to reduce the number of essential bits by 69% on AlexNet, 74% on VGG-16, and 68% on ResNet-152, by which our accelerator is able to reduce the inference computation time by up to 3.57x over the conventional bit-level weight pruning. Secondly, a new algorithm for extracting common kernels and convolutions to maximally eliminate the redundant operations among the convolutions in binary- and ternary-weight convolutional neural networks is presented. Specifically, we propose (1) a new algorithm of common kernel extraction to overcome the local and limited exploration of common kernel candidates by the existing method, and subsequently apply (2) a new concept of common convolution extraction to maximally eliminate the redundancy in the convolution operations. In addition, our algorithm is able to (3) tune in minimizing the number of resulting kernels for convolutions, thereby saving the total memory access latency for kernels. Experimental results on ternary-weight VGG-16 demonstrate that our convolution optimization algorithm is very effective, reducing the total number of operations for all convolutions by 25.8-26.3%, thereby reducing the total number of execution cycles on hardware platform by 22.4% while using 2.7-3.8% fewer kernels over that of the convolution utilizing the common kernels extracted by the state-of-the-art algorithm. Finally, we propose solutions for DNNs with unfitted compression to maintain the accuracy, in which all distinct weights of the compressed DNNs could not be entirely contained in on-chip memory. Precisely, given an access sequence of weights, (1) the first problem is to arrange the weights in off-chip memory, so that the number of memory accesses to the off-chip memory (equivalently the energy consumed by the accesses) be minimized, and (2) the second problem is to devise a strategy of selecting a weight block in on-chip memory for replacement when a block miss occurs, with the objective of minimizing the total energy consumed by the off-chip memory accesses and the overhead of scanning indexes for block replacement. Through experiments with the model of compressed AlexNet, it is shown that our solutions are able to reduce the total energy consumption of the off-chip memory accesses including the scanning overhead by 34.2% on average over the use of unoptimized memory layout and LRU replacement scheme.1 Introduction 1 1.1 Deep Neural Networks and Its Challenges 1 1.2 Redundant Weight Elimination Methods in DNN 4 1.3 Redundant Representation Elimination Methods in DNN 8 1.4 Contributions of This Dissertation 12 2 Bit-level Weight Pruning Techniques for High-Performance Neural Networks 17 2.1 Preliminary 17 2.1.1 Bit-level Weight Pruning in Binary Representation 17 2.1.2 Bit-level Weight Pruning in Signed-digit Representation 19 2.1.3 CSD Representation Conversion 21 2.2 Motivations 23 2.2.1 Inefficiency in Two's Complement Representation 23 2.2.2 Inability to Exploit Signed-digit Representation 25 2.3 Signed-digit Representation-based Deeper Weight Pruning 28 2.3.1 Generating Signed-digit Representations 28 2.3.2 Selecting Signed-digit Representations for Maximal Parallelism 30 2.3.3 Extension to the Low-precision Weights 32 2.4 Supporting Hardware Architecture 33 2.4.1 Technique for Using a Single Bit to Encode Ternary Value 33 2.4.2 Structure of Supporting Architecture 35 2.4.3 Memory Analysis 37 2.4.4 Full Utilization of Accumulation Adders 38 2.4.5 Modification for Hybrid Approach 38 2.5 Bit-level Intra-weight Pruning 41 2.5.1 Signed-digit Representation Conversion 41 2.5.2 Encoding Technique 41 2.5.3 Supporting Hardware Architecture 42 2.6 Experimental Results 44 2.6.1 Essential Bits 44 2.6.2 Memory Usage 46 2.6.3 Performance 46 2.6.4 Area 50 2.6.5 Energy Efficiency 56 3 Convolution Computation Techniques for High-Performance Neural Networks 59 3.1 Motivations 59 3.1.1 Limited Space Exploration for Common Kernels 59 3.1.2 Inability to Exploit Common Expressions of Convolution Values 61 3.2 The Proposed Algorithm 63 3.2.1 Common Kernel Extraction 63 3.2.2 Common Convolution Extraction 67 3.2.3 Memory Access Minimization 69 3.3 Hardware Implementation 70 3.4 Experimental Results 72 3.4.1 Experimental Setup 72 3.4.2 Assessing Effectiveness of ConvOpt-op and ConvOpt-mem 72 3.4.3 Measuring Performance through Hardware Implementation 78 3.4.4 Running Time of ConvOpt 78 4 Memory Layout and Block Replacement Techniques for High-Performance Neural Networks 81 4.1 Motivation 81 4.2 Algorithms for Off-chip Memory Access Optimization for DNNs with Unfitted Compression 84 4.2.1 Algorithm for Off-chip Memory Layout 84 4.2.2 Algorithm for On-chip Memory Block Replacement 86 4.2.3 Exploitation of Parallel Computing 91 4.3 Experimental Results 94 4.3.1 Experimental Setup 94 4.3.2 Assessing the Effectiveness of Mem-layout 94 4.3.3 Assessing the Effectiveness of MIN-k Combined with Mem-layout 97 5 Conclusions 101 5.1 Bit-level Weight Pruning Techniques for High-Performance Neural Networks 101 5.2 Convolution Computation Techniques for High-Performance Neural Networks 102 5.3 Memory Layout and Block Replacement Techniques for High-Performance Neural Networks 102 Abstract (In Korean) 117Docto

    Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

    Full text link
    The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains

    On the Real-Time Performance, Robustness and Accuracy of Medical Image Non-Rigid Registration

    Get PDF
    Three critical issues about medical image non-rigid registration are performance, robustness and accuracy. A registration method, which is capable of responding timely with an accurate alignment, robust against the variation of the image intensity and the missing data, is desirable for its clinical use. This work addresses all three of these issues. Unacceptable execution time of Non-rigid registration (NRR) often presents a major obstacle to its routine clinical use. We present a hybrid data partitioning method to parallelize a NRR method on a cooperative architecture, which enables us to get closer to the goal: accelerating using architecture rather than designing a parallel algorithm from scratch. to further accelerate the performance for the GPU part, a GPU optimization tool is provided to automatically optimize GPU execution configuration.;Missing data and variation of the intensity are two severe challenges for the robustness of the registration method. A novel point-based NRR method is presented to resolve mapping function (deformation field) with the point correspondence missing. The novelty of this method lies in incorporating a finite element biomechanical model into an Expectation and Maximization (EM) framework to resolve the correspondence and mapping function simultaneously. This method is extended to deal with the deformation induced by tumor resection, which imposes another challenge, i.e. incomplete intra-operative MRI. The registration is formulated as a three variable (Correspondence, Deformation Field, and Resection Region) functional minimization problem and resolved by a Nested Expectation and Maximization framework. The experimental results show the effectiveness of this method in correcting the deformation in the vicinity of the tumor. to deal with the variation of the intensity, two different methods are developed depending on the specific application. For the mono-modality registration on delayed enhanced cardiac MRI and cine MRI, a hybrid registration method is designed by unifying both intensity- and feature point-based metrics into one cost function. The experiment on the moving propagation of suspicious myocardial infarction shows effectiveness of this hybrid method. For the multi-modality registration on MRI and CT, a Mutual Information (MI)-based NRR is developed by modeling the underlying deformation as a Free-Form Deformation (FFD). MI is sensitive to the variation of the intensity due to equidistant bins. We overcome this disadvantage by designing a Top-to-Down K-means clustering method to naturally group similar intensities into one bin. The experiment shows this method can increase the accuracy of the MI-based registration.;In image registration, a finite element biomechanical model is usually employed to simulate the underlying movement of the soft tissue. We develop a multi-tissue mesh generation method to build a heterogeneous biomechanical model to realistically simulate the underlying movement of the brain. We focus on the following four critical mesh properties: tissue-dependent resolution, fidelity to tissue boundaries, smoothness of mesh surfaces, and element quality. Each mesh property can be controlled on a tissue level. The experiments on comparing the homogeneous model with the heterogeneous model demonstrate the effectiveness of the heterogeneous model in improving the registration accuracy

    Acceleration of a Dynamically Packed Oblique Sparse Projection Random Forest

    Get PDF
    The proliferation of scientific and industrial sensors is causing an accelerating deluge of data, the processing of which into actionable knowledge requires fast and accurate machine learning methods. A class of algorithms suited to process these large amounts of data is decision forests, widely used methods known for their versatility, state of the art inference, and fast model training. Oblique Sparse Projection Forests โ€” OSPFs โ€” are a subset of decision forests, which provide data inference superior to other methods. Despite providing state of the art inference and having a computational complexity similar to other popular decision forests, there are no OSPF implementations that scale beyond trivially sized datasets. We explore whether OSPF training and inference speeds can compete with other popular decision forest variants despite an algorithmic incompatibility which prevent OSPFs from using traditional forest training optimizations. First, using R, we implement a highly extensible proof of concept version of a recently conceived OSPF, Randomer Forest, shown to provide state of the art results on many datasets and provide this system for general use via CRAN. We then develop and implement a postprocessing method, Forest Packing, to pack the nodes of a trained forest into a novel data structure and modify the ensemble traversal method to accelerate forest based inferences. Finally, we develop FastRerF, an optimized version of Randomer Forest which dynamically performs forest packing during training. The initial implementation in R provided training speeds inline with other decision forest systems and scaled better with additional resources, but used an excessive amount of memory and provided slow inference speeds. The development of Forest Packing increased inference throughput by almost an order of magnitude as compared to other systems while greatly reducing prediction latency. FastRerF model training is faster than other popular decision forest systems when using similar parameters and trains Random Forests faster than the current state of the art. Overall, we provide data scientists a novel OSPF system with R and Python front ends, which trains and predicts faster than other decision forest implementations
    • โ€ฆ
    corecore