11 research outputs found

    Get Out of the Valley: Power-Efficient Address Mapping for GPUs

    Get PDF
    GPU memory systems adopt a multi-dimensional hardware structure to provide the bandwidth necessary to support 100s to 1000s of concurrent threads. On the software side, GPU-compute workloads also use multi-dimensional structures to organize the threads. We observe that these structures can combine unfavorably and create significant resource imbalance in the memory subsystem causing low performance and poor power-efficiency. The key issue is that it is highly application-dependent which memory address bits exhibit high variability. To solve this problem, we first provide an entropy analysis approach tailored for the highly concurrent memory request behavior in GPU-compute workloads. Our window-based entropy metric captures the information content of each address bit of the memory requests that are likely to co-exist in the memory system at runtime. Using this metric, we find that GPU-compute workloads exhibit entropy valleys distributed throughout the lower order address bits. This indicates that efficient GPU-address mapping schemes need to harvest entropy from broad address-bit ranges and concentrate the entropy into the bits used for channel and bank selection in the memory subsystem. This insight leads us to propose the Page Address Entropy (PAE) mapping scheme which concentrates the entropy of the row, channel and bank bits of the input address into the bank and channel bits of the output address. PAE maps straightforwardly to hardware and can be implemented with a tree of XOR-gates. PAE improves performance by 1.31 x and power-efficiency by 1.25 x compared to state-of-the-art permutation-based address mapping

    Intra-cluster coalescing to reduce GPU NoC pressure

    Get PDF
    GPUs continue to increase the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Issuing redundant requests to access the same memory location wastes valuable NoC bandwidth - we find on average 19.4% (and up to 48%) of the requests to be redundant. To reduce redundant NoC traffic, we propose intracluster coalescing (ICC) to merge memory requests from different SMs in a cluster. Our evaluation results show that ICC achieves an average performance improvement of 9.7% (and up to 33%) over a conventional design

    Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs

    Get PDF
    ยฉ 2023 Copyright held by the owner/author(s). This document is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ This document is the Accepted version of a Published Work that appeared in final form in 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT), Viena, Austria, October 2023. To access the final edited and published work see https://doi.org/10.1109/PACT58117.2023.00019Literature is plentiful in works exploiting cache locality for GPUs. A majority of them explore replacement or bypassing policies. In this paper, however, we surpass this exploration by fabricating a formal proof for a no-overhead quasi-optimal caching technique for caching textures in graphics workloads. Textures make up a significant part of main memory traffic in mobile GPUs, which contributes to the total GPU energy consumption. Since texture accesses use a shared L2 cache, improving the L2 texture caching efficiency would decrease main memory traffic, thus improving energy efficiency, which is crucial for mobile GPUs. Our proposal reaches quasi-optimality by exploiting the frame-to-frame reuse of textures in graphics. We do this by traversing frames in a boustrophedonic1 manner w.r.t. the frame-to-frame tile order. We first approximate the texture access trace to a circular trace and then forge a formal proof for our proposal being optimal for such traces. We also complement the proof with empirical data that demonstrates the quasi-optimality of our no-cost proposal

    CD-Xbar : a converge-diverge crossbar network for high-performance GPUs

    Get PDF
    Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5 and 48.5 percent, respectively, while at the same time improving performance by 13.9 percent on average. CD-Xbar performs within 2.9 percent of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance perWatt (by 17.1 percent) over state-of-the-art GPU NoCs which are highly customized and non-scalable

    Feluca : A two-stage graph coloring algorithm with color-centric paradigm on GPU

    Get PDF
    In this paper, we propose a two-stage high-performance graph coloring algorithm, called Feluca, aiming to address the above challenges. Feluca combines the recursion-based method with the sequential spread-based method. In the first stage, Feluca uses a recursive routine to color a majority of vertices in the graph. Then, it switches to the sequential spread method to color the remaining vertices in order to avoid the conflicts of the recursive algorithm. Moreover, the following techniques are proposed to further improve the graph coloring performance. i) A new method is proposed to eliminate the cycles in the graph; ii) a top-down scheme is developed to avoid the atomic operation originally required for color selection; and iii) a novel color-centric coloring paradigm is designed to improve the degree of parallelism for the sequential spread part. All these newly developed techniques, together with further GPU-specific optimizations such as coalesced memory access, comprise an efficient parallel graph coloring solution in Feluca. We have conducted extensive experiments on NVIDIA GPUs. The results show that Feluca can achieve 1.76 - 12.98x speedup over the state-of-the-art algorithms

    Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure

    Get PDF
    GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Sending redundant requests to access the same memory location wastes valuable NoC bandwidth-we find on average 19 percent (and up to 48 percent) of the requests to be redundant. To remove redundant NoC traffic, we propose distributed-block scheduling, intra-cluster coalescing (ICC) and the coalesced cache (CC) to coalesce L1 cache misses within and across SMs in a cluster, respectively. Our evaluation results show that distributed-block scheduling, ICC and CC are complementary and improve both performance and energy consumption. We report an average performance improvement of 15 percent (and up to 67 percent) while at the same time reducing system energy by 6 percent (and up to 19 percent) and improving the energy-delay product (EDP) by 19 percent on average (and up to 53 percent), compared to state-of-the-art distributed CTA scheduling

    Acceleration of CNN Computation on a PIM-enabled GPU system

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022. 8. ์ดํ˜์žฌ.์ตœ๊ทผ, convolutional neural network (CNN)์€ image processing ๋ฐ computer vision ๋“ฑ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. CNN์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์ ์ธ convolutional layer์™€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์ ์ธ fully connected layer, batch normalization layer ๋ฐ activation layer ๋“ฑ ๋‹ค์–‘ํ•œ layer๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ CNN์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด GPU๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜์ง€๋งŒ, CNN์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์ ์ธ ๋™์‹œ์— ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์ ์ด๊ธฐ์— ์„ฑ๋Šฅ์ด ์ œํ•œ๋œ๋‹ค. ๋˜ํ•œ, ๊ณ ํ™”์งˆ์˜ image ๋ฐ video application์˜ ์‚ฌ์šฉ์€ GPU์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ„์˜ data ์ด๋™์— ์˜ํ•œ ๋ถ€๋‹ด์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. Processing-in-memory๋Š” ๋ฉ”๋ชจ๋ฆฌ์— ์—ฐ์‚ฐ๊ธฐ๋ฅผ ํƒ‘์žฌํ•˜์—ฌ data ์ด๋™์— ์˜ํ•œ ๋ถ€๋‹ด์„ ์ค„์ผ ์ˆ˜ ์žˆ์–ด, host GPU์™€ PIM์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” system์€ CNN์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ์— ์ ํ•ฉํ•˜๋‹ค. ๋จผ์ € convolutional layer์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ๊ทผ์‚ฌ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ๊ทผ์‚ฌ ์—ฐ์‚ฐ์€ host GPU๋กœ data๋ฅผ load ํ•œ ํ›„ data ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ์—, GPU์™€ DRAM ๊ฐ„์˜ data ์ด๋™๋Ÿ‰์„ ์ค„์ด์ง€๋Š” ๋ชปํ•œ๋‹ค. ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ intensity๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ๋ฉ”๋ชจ๋ฆฌ bottleneck์„ ์œ ๋ฐœํ•œ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๊ทผ์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ์ธํ•ด warp ๊ฐ„ load imbalance ๋˜ํ•œ ๋ฐœ์ƒํ•˜๊ฒŒ ๋˜์–ด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋œ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” data ๊ฐ„ ๊ทผ์‚ฌ ๋น„๊ต๋ฅผ PIM์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ PIM์—์„œ data๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ํŒŒ์•…ํ•œ ํ›„, ๋Œ€ํ‘œ data์™€ ์œ ์‚ฌ๋„ ์ •๋ณด๋งŒ์„ GPU๋กœ ์ „์†กํ•œ๋‹ค. GPU๋Š” ๋Œ€ํ‘œ data์— ๋Œ€ํ•ด์„œ๋งŒ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ์œ ์‚ฌ๋„ ์ •๋ณด์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ํ•ด๋‹น ๊ฒฐ๊ณผ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋•Œ, ๋ฉ”๋ชจ๋ฆฌ์—์„œ์˜ data ๋น„๊ต๋กœ ์ธํ•œ latency ์ฆ๊ฐ€๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด DRAM์˜ bank ๋‹จ๊ณผ TSV ๋‹จ์„ ๋ชจ๋‘ ํ™œ์šฉํ•˜๋Š” 2-level PIM ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ, ๋Œ€ํ‘œ data๋ฅผ ์ ๋‹นํ•œ address์— ์žฌ๋ฐฐ์น˜ํ•œ ํ›„ GPU๋กœ ์ „์†กํ•˜์—ฌ GPU์—์„œ์˜ ๋ณ„๋„ ์ž‘์—… ์—†์ด load balancing์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ, batch normalization ๋“ฑ non-convolutional layer์˜ ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์œผ๋กœ ์ธํ•œ ๋ฉ”๋ชจ๋ฆฌ bottleneck ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด PIM์—์„œ non-convolutional layer๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” PIM์œผ๋กœ non-convolutional layer๋ฅผ ๊ฐ€์†ํ•˜์˜€์ง€๋งŒ, ๋‹จ์ˆœํžˆ GPU์™€ PIM์ด ์ˆœ์ฐจ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์ƒํ™ฉ์„ ๊ฐ€์ •ํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ non-convolutional layer๊ฐ€ ouptut feature map์˜ channel ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰๋œ๋‹ค๋Š” ์ ์— ์ฐฉ์•ˆํ•˜์—ฌ host์™€ PIM์„ pipeline์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ CNN ํ•™์Šต์„ ๊ฐ€์†ํ•œ๋‹ค. PIM์€ host์—์„œ convolution ์—ฐ์‚ฐ์ด ๋๋‚œ output feature map์˜ channel์— ๋Œ€ํ•ด non-convolution ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” weight update์™€ feature map gradient ๊ณ„์‚ฐ์—์„œ์˜ convolution๊ณผ non-convolution ๊ฐ„ job ๊ท ํ˜•์„ ์œ„ํ•ด, ์ ์ ˆํ•˜๊ฒŒ non-convolution job์„ ๋ถ„๋ฐฐํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ์ด์— ๋”ํ•ด, host์™€ PIM์ด ๋™์‹œ์— memory์— accessํ•˜๋Š” ์ƒํ™ฉ์—์„œ ์ „์ฒด ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด bank ์†Œ์œ ๊ถŒ ๊ธฐ๋ฐ˜์˜ host์™€ PIM ๊ฐ„ memory scheduling ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, image processing application ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด logic die์— ํƒ‘์žฌ ๊ฐ€๋Šฅํ•œ PIM GPU ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. GPU ๊ธฐ๋ฐ˜์˜ PIM์€ CUDA ๊ธฐ๋ฐ˜์˜ application์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์–ด ๋”ฅ๋Ÿฌ๋‹ ๋ฐ image application์˜ ์ฒ˜๋ฆฌ์— ์ ํ•ฉํ•˜์ง€๋งŒ, GPU์˜ ํฐ ์šฉ๋Ÿ‰์˜ on-chip SRAM์€ logic die์— ์ถฉ๋ถ„ํ•œ ์ˆ˜์˜ computing unit์˜ ํƒ‘์žฌ๋ฅผ ์–ด๋ ต๊ฒŒ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” PIM์— ์ ํ•ฉํ•œ ์ตœ์ ์˜ lightweight GPU ๊ตฌ์กฐ์™€ ํ•จ๊ป˜ ์ด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. Image processing application์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ data locality๊ฐ€ ๋ณด์กด๋˜๋„๋ก ๊ฐ computing unit์— data๋ฅผ ํ• ๋‹นํ•˜๊ณ , ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ data์˜ ํ• ๋‹น์„ ๊ธฐ๋ฐ˜์œผ๋กœ prefetcher๋ฅผ ํƒ‘์žฌํ•˜์—ฌ lightweightํ•œ ๊ตฌ์กฐ์ž„์—๋„ ์ถฉ๋ถ„ํ•œ ์ˆ˜์˜ computing unit์„ ํƒ‘์žฌํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ํ™•๋ณดํ•œ๋‹ค.Recently, convolutional neural networks (CNN) have been widely used in image processing and computer vision. CNNs are composed of various layers such as computation-intensive convolutional layer and memory-intensive fully connected layer, batch normalization layer, and activation layer. GPUs are often used to accelerate the CNN, but performance is limited by high computational costs and memory usage of the convolution. Also, increasing demand for high resolution image applications increases the burden of data movement between GPU and memory. By performing computations on the memory, processing-in-memory (PIM) is expected to mitigate the overhead caused by data transfer. Therefore, a system that uses a PIM is promising for processing CNNs. First, prior studies exploited approximate computing to reduce the computational costs. However, they only reduced the amount of the computation, thereby its performance is bottlenecked by the memory bandwidth due to an increased memory intensity. In addition, load imbalance between warps caused by approximation also inhibits the performance improvement. This dissertation proposes a PIM solution that reduces the amount of data movement and computation through the Approximate Data Comparison (ADC-PIM). Instead of determining the value similarity on the GPU, the ADC-PIM located on memory compares the similarity and transfers only the selected data to the GPU. The GPU performs convolution on the representative data transferred from the ADC-PIM, and reuses the calculated results based on the similarity information. To reduce the increase in memory latency due to the data comparison, a two-level PIM architecture that exploits both the DRAM bank and TSV stage is proposed. To ease the load balancing on the GPU, the ADC-PIM reorganizes data by assigning the representative data to proposer addresses that are computed based on the comparison result. Second, to solve the memory bottleneck caused by the high memory usage, non-convolutional layers are accelerated with PIM. Previous studies also accelerated the non-convolutional layers by PIM, but there was a limit to performance improvement because they simply assumed a situation in which the GPU and PIM operate sequentially. The proposed method accelerates the CNN training with a pipelined execution of GPU and PIM, focusing on the fact that the non-convolution operation is performed in units of channels of the output feature map. PIM performs non-convolutional operations on the output feature map where the GPU has completed the convolution operation. To balance the jobs between convolution and non-convolution in weight update and feature map gradient calculation that occur in the back propagation process, non-convolution job is properly distributed to each process. In addition, a memory scheduling algorithm based on bank ownership between the host and PIM is proposed to minimize the overall execution time in a situation where the host and PIM simultaneously access memory. Finally, a GPU-based PIM architecture for image processing application is proposed. Programmable GPU-based PIM is attractive because it enables the utilization of well-crafted software development kits (SDKs) such as CUDA and openCL. However, the large capacity of on-chip SRAM of GPU makes it difficult to mount a sufficient number of computing units in logic die. This dissertation proposes a GPU-based PIM architecture and well-matched optimization strategies considering both the characteristics of image applications and logic die constraints. Data allocation to the computing unit is addressed to maintain the data locality and data access pattern. By applying a prefetcher that leverages the pattern-aware data allocation, the number of active warps and the on-chip SRAM size of the PIM are significantly reduced. This enables the logic die constraints to be satisfied and a greater number of computing units to be integrated on a logic die.์ œ 1 ์žฅ ์„œ ๋ก  1 1.1 ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ 1 1.2 ์—ฐ๊ตฌ์˜ ๋‚ด์šฉ 3 1.3 ๋…ผ๋ฌธ ๊ตฌ์„ฑ 4 ์ œ 2 ์žฅ ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ ์ง€์‹ 5 2.1 High Bandwidth Memory 5 2.2 Processing-In-Memory 6 2.3 GPU์˜ ๊ตฌ์กฐ ๋ฐ ๋™์ž‘ ๋ชจ๋ธ 7 ์ œ 3 ์žฅ PIM์„ ํ™œ์šฉํ•œ ๊ทผ์‚ฌ์  ๋ฐ์ดํ„ฐ ๋น„๊ต ๋ฐ ๊ทผ์‚ฌ ์—ฐ์‚ฐ์„ ํ†ตํ•œ Convolution ๊ฐ€์† 9 3.1 ๊ด€๋ จ ์—ฐ๊ตฌ 10 3.1.1 CNN์—์„œ์˜ Approximate Computing 10 3.1.2 Processing In Memory๋ฅผ ํ™œ์šฉํ•œ CNN ๊ฐ€์† 11 3.2 Motivation 13 3.2.1 GPU์—์„œ Convolution ์—ฐ์‚ฐ ์‹œ์˜ Approximation ๊ธฐํšŒ 13 3.2.2 Approxiamte Convolution ์—ฐ์‚ฐ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์  14 3.3 ์ œ์•ˆํ•˜๋Š” ADC-PIM Design 18 3.3.1 Overview 18 3.3.2 Data ๊ฐ„ ์œ ์‚ฌ๋„ ๋น„๊ต ๋ฐฉ๋ฒ• 19 3.3.3 ADC-PIM ์•„ํ‚คํ…์ฒ˜ 21 3.3.4 Load Balancing์„ ์œ„ํ•œ Data Reorganization 27 3.4 GPU์—์„œ์˜ Approximate Convolution 31 3.4.1 Instruction Skip์„ ํ†ตํ•œ Approximate Convolution 31 3.4.2 Approximate Convolution์„ ์œ„ํ•œ ๊ตฌ์กฐ์  ์ง€์› 32 3.5 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„ 36 3.5.1 ์‹คํ—˜ ํ™˜๊ฒฝ ๊ตฌ์„ฑ 36 3.5.2 ์ œ์•ˆํ•˜๋Š” ๊ฐ ๋ฐฉ๋ฒ•์˜ ์˜ํ–ฅ ๋ถ„์„ 38 3.5.3 ๊ธฐ์กด ์—ฐ๊ตฌ์™€์˜ ์„ฑ๋Šฅ ๋น„๊ต 41 3.5.4 ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋น„๊ต 44 3.5.5 Design Overhead ๋ถ„์„ 44 3.5.6 ์ •ํ™•๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ 46 3.6 ๋ณธ ์žฅ์˜ ๊ฒฐ๋ก  47 ์ œ 4 ์žฅ Convolutional layer์™€ non-Convolutional Layer์˜ Pipeline ์‹คํ–‰์„ ํ†ตํ•œ CNN ํ•™์Šต ๊ฐ€์† 48 4.1 ๊ด€๋ จ ์—ฐ๊ตฌ 48 4.1.1 Non-CONV Lasyer์˜ Memory Bottleneck ์™„ํ™” 48 4.1.2 Host์™€ PIM ๊ฐ„ Memory Scheduling 49 4.2 Motivation 51 4.2.1 CONV์™€ non-CONV์˜ ๋™์‹œ ์ˆ˜ํ–‰ ์‹œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐํšŒ 51 4.2.2 PIM ์šฐ์„ ๋„์— ๋”ฐ๋ฅธ host ๋ฐ PIM request์˜ ์ฒ˜๋ฆฌ ํšจ์œจ์„ฑ ๋ณ€ํ™” 52 4.3 ์ œ์•ˆํ•˜๋Š” host-PIM Memory Scheduling ์•Œ๊ณ ๋ฆฌ์ฆ˜ 53 4.3.1 host-PIM System Overview 53 4.3.2 PIM Duration Based Memory Scheduling 53 4.3.3 ์ตœ์  PD_TH ๊ฐ’์˜ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ• 56 4.4 ์ œ์•ˆํ•˜๋Š” CNN ํ•™์Šต ๋™์ž‘ Flow 62 4.4.1 CNN ํ•™์Šต ์ˆœ์ „ํŒŒ ๊ณผ์ • 62 4.4.2 CNN ํ•™์Šต ์—ญ์ „ํŒŒ ๊ณผ์ • 63 4.5 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„ 67 4.5.1 ์‹คํ—˜ ํ™˜๊ฒฝ ๊ตฌ์„ฑ 67 4.5.2 Layer ๋‹น ์ˆ˜ํ–‰ ์‹œ๊ฐ„ ๋ณ€ํ™” 68 4.5.3 ์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ์˜ non-CONV job ๋ฐฐ๋ถ„ ํšจ๊ณผ 70 4.5.4 ์ „์ฒด Network Level์—์„œ์˜ ์ˆ˜ํ–‰ ์‹œ๊ฐ„ ๋ณ€ํ™” 72 4.5.5 ์ œ์•ˆํ•˜๋Š” ์ตœ์  PD_TH ์ถ”์ • ๋ฐฉ๋ฒ•์˜ ์ •ํ™•๋„ ๋ฐ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜๋ ด ์†๋„ 74 4.6 ๋ณธ ์žฅ์˜ ๊ฒฐ๋ก  75 ์ œ 5 ์žฅ Image processing์˜ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด์„ ํ™œ์šฉํ•œ PIM์— ์ ํ•ฉํ•œ lightweight GPU ๊ตฌ์กฐ 76 5.1 ๊ด€๋ จ ์—ฐ๊ตฌ 77 5.1.1 Processing In Memory 77 5.1.2 GPU์—์„œ์˜ CTA Scheduling 78 5.1.3 GPU์—์„œ์˜ Prefetching 78 5.2 Motivation 79 5.2.1 PIM GPU system์—์„œ Image Processing ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฒ˜๋ฆฌ ์‹œ ๊ธฐ์กด GPU ๊ตฌ์กฐ์˜ ๋น„ํšจ์œจ์„ฑ 79 5.3 ์ œ์•ˆํ•˜๋Š” GPU ๊ธฐ๋ฐ˜ PIM System 82 5.3.1 Overview 82 5.3.2 Access Pattern์„ ๊ณ ๋ คํ•œ CTA ํ• ๋‹น 83 5.3.3 PIM GPU ๊ตฌ์กฐ 90 5.4 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„ 94 5.4.1 ์‹คํ—˜ ํ™˜๊ฒฝ ๊ตฌ์„ฑ 94 5.4.2 In-Depth Analysis 95 5.4.3 ๊ธฐ์กด ์—ฐ๊ตฌ์™€์˜ ์„ฑ๋Šฅ ๋น„๊ต 98 5.4.4 Cache Miss Rate ๋ฐ Memory Traffic 102 5.4.5 ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋น„๊ต 103 5.4.6 PIM์˜ ๋ฉด์  ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰ ๋ถ„์„ 105 5.5 ๋ณธ ์žฅ์˜ ๊ฒฐ๋ก  107 ์ œ 6 ์žฅ ๊ฒฐ๋ก  108 ์ฐธ๊ณ ๋ฌธํ—Œ 110 Abstract 118๋ฐ•

    Locality-aware CTA Clustering for modern GPUs

    No full text
    \u3cp\u3eCache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential - the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small incore cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTAClustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization.We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithmrelated inter-CTA reuse.\u3c/p\u3

    Fault Tolerant and Energy Efficient One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs

    Get PDF
    Heterogeneous computing system with both CPUs and GPUs has become a class of widely used hardware architecture in supercomputers. As heterogeneous systems delivering higher computational performance, they are being built with an increasing number of complex components. This is anticipated that these systems will be more susceptible to hardware faults with higher power consumption. Numerical linear algebra libraries are used in a wide spectrum of high-performance scientific applications. Among numerical linear algebra operations, one-sided matrix decompositions can sometimes take a large portion of execution time or even dominate the whole scientific application execution. Due to the computational characteristic of one-sided matrix decompositions, they are very suitable for computation platforms such as heterogeneous systems with CPUs and GPUs. Many works have been done to implement and optimize one-sided matrix decompositions on heterogeneous systems with CPUs and GPUs. However, it is challenging to enable stable and high performance one-sided matrix decompositions running on computing platforms that are unreliable and high energy consumption. So, in this thesis, we aim to develop novel fault tolerance and energy efficiency optimizations for one-sided matrix decompositions on heterogeneous systems with CPUs and GPUs.To improve reliability and energy efficiency, extensive researches have been done on developing and optimizing fault tolerance methods and energy-saving strategies for one-sided matrix decompositions. However, current designs still have several limitations: (1) Little has been done on developing and optimizing fault tolerance method for one-sided matrix decompositions on heterogeneous systems with GPUs; (2) Limited by the protection coverage and strength, existing fault tolerance works provide insufficient protection when applied to one-sided matrix decompositions on heterogeneous systems with GPUs; (3) Lack the knowledge of algorithms, existing system level energy saving solutions cannot achieve the optimal energy savings due to potentially inaccurate and high-cost workload prediction they rely on when they are used in one-sided matrix decompositions; (4) It is challenging to apply both fault tolerance techniques and energy saving strategies to one-side matrix decompositions at the same time given that their current designs are not naturally compatible with each other.To address the first problem, based on the original (Algorithm Based Fault Tolerance) ABFT, we develop the first ABFT for matrix decomposition on heterogeneous systems with GPUs together with the novel storage errors protection and several optimization techniques specifically for GPUs. As for the second problem, we design a novel checksum scheme for ABFT that allows data stored in matrices to be encoded in two dimensions. This stronger checksum encoding mechanism enables much stronger protection including enhanced error propagation protection. In addition, we introduce a more efficient checking scheme. By prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors with optimized checksum verification kernel for GPUs, we can achieve strong protect to matrix decompositions with comparable overhead. For the third problem, to improve energy efficiency for one-sided matrix decompositions, we introduce an algorithm-based energy-saving approach designed to maximize energy savings by utilizing algorithmic characteristics. Our approach can predict program execution behavior much more accurately, which is difficult for system level solutions for applications with variable execution characteristics. Experiments show that our approach can lead to much higher energy saving than existing works. Finally, for the fourth problem, we propose a novel energy saving approach for one-sided matrix decompositions on heterogeneous systems with GPUs. It allows energy saving strategies and fault tolerance techniques to be enabled at the same time without brings performance impact or extra energy cost
    corecore