4 research outputs found

    Acceleration of CNN Computation on a PIM-enabled GPU system

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022. 8. ์ดํ˜์žฌ.์ตœ๊ทผ, convolutional neural network (CNN)์€ image processing ๋ฐ computer vision ๋“ฑ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. CNN์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์ ์ธ convolutional layer์™€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์ ์ธ fully connected layer, batch normalization layer ๋ฐ activation layer ๋“ฑ ๋‹ค์–‘ํ•œ layer๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ CNN์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด GPU๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜์ง€๋งŒ, CNN์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์ ์ธ ๋™์‹œ์— ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์ ์ด๊ธฐ์— ์„ฑ๋Šฅ์ด ์ œํ•œ๋œ๋‹ค. ๋˜ํ•œ, ๊ณ ํ™”์งˆ์˜ image ๋ฐ video application์˜ ์‚ฌ์šฉ์€ GPU์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ„์˜ data ์ด๋™์— ์˜ํ•œ ๋ถ€๋‹ด์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. Processing-in-memory๋Š” ๋ฉ”๋ชจ๋ฆฌ์— ์—ฐ์‚ฐ๊ธฐ๋ฅผ ํƒ‘์žฌํ•˜์—ฌ data ์ด๋™์— ์˜ํ•œ ๋ถ€๋‹ด์„ ์ค„์ผ ์ˆ˜ ์žˆ์–ด, host GPU์™€ PIM์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” system์€ CNN์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ์— ์ ํ•ฉํ•˜๋‹ค. ๋จผ์ € convolutional layer์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ๊ทผ์‚ฌ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ๊ทผ์‚ฌ ์—ฐ์‚ฐ์€ host GPU๋กœ data๋ฅผ load ํ•œ ํ›„ data ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ์—, GPU์™€ DRAM ๊ฐ„์˜ data ์ด๋™๋Ÿ‰์„ ์ค„์ด์ง€๋Š” ๋ชปํ•œ๋‹ค. ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ intensity๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ๋ฉ”๋ชจ๋ฆฌ bottleneck์„ ์œ ๋ฐœํ•œ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๊ทผ์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ์ธํ•ด warp ๊ฐ„ load imbalance ๋˜ํ•œ ๋ฐœ์ƒํ•˜๊ฒŒ ๋˜์–ด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋œ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” data ๊ฐ„ ๊ทผ์‚ฌ ๋น„๊ต๋ฅผ PIM์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ PIM์—์„œ data๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ํŒŒ์•…ํ•œ ํ›„, ๋Œ€ํ‘œ data์™€ ์œ ์‚ฌ๋„ ์ •๋ณด๋งŒ์„ GPU๋กœ ์ „์†กํ•œ๋‹ค. GPU๋Š” ๋Œ€ํ‘œ data์— ๋Œ€ํ•ด์„œ๋งŒ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ์œ ์‚ฌ๋„ ์ •๋ณด์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ํ•ด๋‹น ๊ฒฐ๊ณผ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋•Œ, ๋ฉ”๋ชจ๋ฆฌ์—์„œ์˜ data ๋น„๊ต๋กœ ์ธํ•œ latency ์ฆ๊ฐ€๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด DRAM์˜ bank ๋‹จ๊ณผ TSV ๋‹จ์„ ๋ชจ๋‘ ํ™œ์šฉํ•˜๋Š” 2-level PIM ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ, ๋Œ€ํ‘œ data๋ฅผ ์ ๋‹นํ•œ address์— ์žฌ๋ฐฐ์น˜ํ•œ ํ›„ GPU๋กœ ์ „์†กํ•˜์—ฌ GPU์—์„œ์˜ ๋ณ„๋„ ์ž‘์—… ์—†์ด load balancing์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ, batch normalization ๋“ฑ non-convolutional layer์˜ ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์œผ๋กœ ์ธํ•œ ๋ฉ”๋ชจ๋ฆฌ bottleneck ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด PIM์—์„œ non-convolutional layer๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” PIM์œผ๋กœ non-convolutional layer๋ฅผ ๊ฐ€์†ํ•˜์˜€์ง€๋งŒ, ๋‹จ์ˆœํžˆ GPU์™€ PIM์ด ์ˆœ์ฐจ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์ƒํ™ฉ์„ ๊ฐ€์ •ํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ non-convolutional layer๊ฐ€ ouptut feature map์˜ channel ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰๋œ๋‹ค๋Š” ์ ์— ์ฐฉ์•ˆํ•˜์—ฌ host์™€ PIM์„ pipeline์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ CNN ํ•™์Šต์„ ๊ฐ€์†ํ•œ๋‹ค. PIM์€ host์—์„œ convolution ์—ฐ์‚ฐ์ด ๋๋‚œ output feature map์˜ channel์— ๋Œ€ํ•ด non-convolution ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” weight update์™€ feature map gradient ๊ณ„์‚ฐ์—์„œ์˜ convolution๊ณผ non-convolution ๊ฐ„ job ๊ท ํ˜•์„ ์œ„ํ•ด, ์ ์ ˆํ•˜๊ฒŒ non-convolution job์„ ๋ถ„๋ฐฐํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ์ด์— ๋”ํ•ด, host์™€ PIM์ด ๋™์‹œ์— memory์— accessํ•˜๋Š” ์ƒํ™ฉ์—์„œ ์ „์ฒด ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด bank ์†Œ์œ ๊ถŒ ๊ธฐ๋ฐ˜์˜ host์™€ PIM ๊ฐ„ memory scheduling ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, image processing application ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด logic die์— ํƒ‘์žฌ ๊ฐ€๋Šฅํ•œ PIM GPU ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. GPU ๊ธฐ๋ฐ˜์˜ PIM์€ CUDA ๊ธฐ๋ฐ˜์˜ application์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์–ด ๋”ฅ๋Ÿฌ๋‹ ๋ฐ image application์˜ ์ฒ˜๋ฆฌ์— ์ ํ•ฉํ•˜์ง€๋งŒ, GPU์˜ ํฐ ์šฉ๋Ÿ‰์˜ on-chip SRAM์€ logic die์— ์ถฉ๋ถ„ํ•œ ์ˆ˜์˜ computing unit์˜ ํƒ‘์žฌ๋ฅผ ์–ด๋ ต๊ฒŒ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” PIM์— ์ ํ•ฉํ•œ ์ตœ์ ์˜ lightweight GPU ๊ตฌ์กฐ์™€ ํ•จ๊ป˜ ์ด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. Image processing application์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ data locality๊ฐ€ ๋ณด์กด๋˜๋„๋ก ๊ฐ computing unit์— data๋ฅผ ํ• ๋‹นํ•˜๊ณ , ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ data์˜ ํ• ๋‹น์„ ๊ธฐ๋ฐ˜์œผ๋กœ prefetcher๋ฅผ ํƒ‘์žฌํ•˜์—ฌ lightweightํ•œ ๊ตฌ์กฐ์ž„์—๋„ ์ถฉ๋ถ„ํ•œ ์ˆ˜์˜ computing unit์„ ํƒ‘์žฌํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ํ™•๋ณดํ•œ๋‹ค.Recently, convolutional neural networks (CNN) have been widely used in image processing and computer vision. CNNs are composed of various layers such as computation-intensive convolutional layer and memory-intensive fully connected layer, batch normalization layer, and activation layer. GPUs are often used to accelerate the CNN, but performance is limited by high computational costs and memory usage of the convolution. Also, increasing demand for high resolution image applications increases the burden of data movement between GPU and memory. By performing computations on the memory, processing-in-memory (PIM) is expected to mitigate the overhead caused by data transfer. Therefore, a system that uses a PIM is promising for processing CNNs. First, prior studies exploited approximate computing to reduce the computational costs. However, they only reduced the amount of the computation, thereby its performance is bottlenecked by the memory bandwidth due to an increased memory intensity. In addition, load imbalance between warps caused by approximation also inhibits the performance improvement. This dissertation proposes a PIM solution that reduces the amount of data movement and computation through the Approximate Data Comparison (ADC-PIM). Instead of determining the value similarity on the GPU, the ADC-PIM located on memory compares the similarity and transfers only the selected data to the GPU. The GPU performs convolution on the representative data transferred from the ADC-PIM, and reuses the calculated results based on the similarity information. To reduce the increase in memory latency due to the data comparison, a two-level PIM architecture that exploits both the DRAM bank and TSV stage is proposed. To ease the load balancing on the GPU, the ADC-PIM reorganizes data by assigning the representative data to proposer addresses that are computed based on the comparison result. Second, to solve the memory bottleneck caused by the high memory usage, non-convolutional layers are accelerated with PIM. Previous studies also accelerated the non-convolutional layers by PIM, but there was a limit to performance improvement because they simply assumed a situation in which the GPU and PIM operate sequentially. The proposed method accelerates the CNN training with a pipelined execution of GPU and PIM, focusing on the fact that the non-convolution operation is performed in units of channels of the output feature map. PIM performs non-convolutional operations on the output feature map where the GPU has completed the convolution operation. To balance the jobs between convolution and non-convolution in weight update and feature map gradient calculation that occur in the back propagation process, non-convolution job is properly distributed to each process. In addition, a memory scheduling algorithm based on bank ownership between the host and PIM is proposed to minimize the overall execution time in a situation where the host and PIM simultaneously access memory. Finally, a GPU-based PIM architecture for image processing application is proposed. Programmable GPU-based PIM is attractive because it enables the utilization of well-crafted software development kits (SDKs) such as CUDA and openCL. However, the large capacity of on-chip SRAM of GPU makes it difficult to mount a sufficient number of computing units in logic die. This dissertation proposes a GPU-based PIM architecture and well-matched optimization strategies considering both the characteristics of image applications and logic die constraints. Data allocation to the computing unit is addressed to maintain the data locality and data access pattern. By applying a prefetcher that leverages the pattern-aware data allocation, the number of active warps and the on-chip SRAM size of the PIM are significantly reduced. This enables the logic die constraints to be satisfied and a greater number of computing units to be integrated on a logic die.์ œ 1 ์žฅ ์„œ ๋ก  1 1.1 ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ 1 1.2 ์—ฐ๊ตฌ์˜ ๋‚ด์šฉ 3 1.3 ๋…ผ๋ฌธ ๊ตฌ์„ฑ 4 ์ œ 2 ์žฅ ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ ์ง€์‹ 5 2.1 High Bandwidth Memory 5 2.2 Processing-In-Memory 6 2.3 GPU์˜ ๊ตฌ์กฐ ๋ฐ ๋™์ž‘ ๋ชจ๋ธ 7 ์ œ 3 ์žฅ PIM์„ ํ™œ์šฉํ•œ ๊ทผ์‚ฌ์  ๋ฐ์ดํ„ฐ ๋น„๊ต ๋ฐ ๊ทผ์‚ฌ ์—ฐ์‚ฐ์„ ํ†ตํ•œ Convolution ๊ฐ€์† 9 3.1 ๊ด€๋ จ ์—ฐ๊ตฌ 10 3.1.1 CNN์—์„œ์˜ Approximate Computing 10 3.1.2 Processing In Memory๋ฅผ ํ™œ์šฉํ•œ CNN ๊ฐ€์† 11 3.2 Motivation 13 3.2.1 GPU์—์„œ Convolution ์—ฐ์‚ฐ ์‹œ์˜ Approximation ๊ธฐํšŒ 13 3.2.2 Approxiamte Convolution ์—ฐ์‚ฐ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์  14 3.3 ์ œ์•ˆํ•˜๋Š” ADC-PIM Design 18 3.3.1 Overview 18 3.3.2 Data ๊ฐ„ ์œ ์‚ฌ๋„ ๋น„๊ต ๋ฐฉ๋ฒ• 19 3.3.3 ADC-PIM ์•„ํ‚คํ…์ฒ˜ 21 3.3.4 Load Balancing์„ ์œ„ํ•œ Data Reorganization 27 3.4 GPU์—์„œ์˜ Approximate Convolution 31 3.4.1 Instruction Skip์„ ํ†ตํ•œ Approximate Convolution 31 3.4.2 Approximate Convolution์„ ์œ„ํ•œ ๊ตฌ์กฐ์  ์ง€์› 32 3.5 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„ 36 3.5.1 ์‹คํ—˜ ํ™˜๊ฒฝ ๊ตฌ์„ฑ 36 3.5.2 ์ œ์•ˆํ•˜๋Š” ๊ฐ ๋ฐฉ๋ฒ•์˜ ์˜ํ–ฅ ๋ถ„์„ 38 3.5.3 ๊ธฐ์กด ์—ฐ๊ตฌ์™€์˜ ์„ฑ๋Šฅ ๋น„๊ต 41 3.5.4 ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋น„๊ต 44 3.5.5 Design Overhead ๋ถ„์„ 44 3.5.6 ์ •ํ™•๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ 46 3.6 ๋ณธ ์žฅ์˜ ๊ฒฐ๋ก  47 ์ œ 4 ์žฅ Convolutional layer์™€ non-Convolutional Layer์˜ Pipeline ์‹คํ–‰์„ ํ†ตํ•œ CNN ํ•™์Šต ๊ฐ€์† 48 4.1 ๊ด€๋ จ ์—ฐ๊ตฌ 48 4.1.1 Non-CONV Lasyer์˜ Memory Bottleneck ์™„ํ™” 48 4.1.2 Host์™€ PIM ๊ฐ„ Memory Scheduling 49 4.2 Motivation 51 4.2.1 CONV์™€ non-CONV์˜ ๋™์‹œ ์ˆ˜ํ–‰ ์‹œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐํšŒ 51 4.2.2 PIM ์šฐ์„ ๋„์— ๋”ฐ๋ฅธ host ๋ฐ PIM request์˜ ์ฒ˜๋ฆฌ ํšจ์œจ์„ฑ ๋ณ€ํ™” 52 4.3 ์ œ์•ˆํ•˜๋Š” host-PIM Memory Scheduling ์•Œ๊ณ ๋ฆฌ์ฆ˜ 53 4.3.1 host-PIM System Overview 53 4.3.2 PIM Duration Based Memory Scheduling 53 4.3.3 ์ตœ์  PD_TH ๊ฐ’์˜ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ• 56 4.4 ์ œ์•ˆํ•˜๋Š” CNN ํ•™์Šต ๋™์ž‘ Flow 62 4.4.1 CNN ํ•™์Šต ์ˆœ์ „ํŒŒ ๊ณผ์ • 62 4.4.2 CNN ํ•™์Šต ์—ญ์ „ํŒŒ ๊ณผ์ • 63 4.5 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„ 67 4.5.1 ์‹คํ—˜ ํ™˜๊ฒฝ ๊ตฌ์„ฑ 67 4.5.2 Layer ๋‹น ์ˆ˜ํ–‰ ์‹œ๊ฐ„ ๋ณ€ํ™” 68 4.5.3 ์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ์˜ non-CONV job ๋ฐฐ๋ถ„ ํšจ๊ณผ 70 4.5.4 ์ „์ฒด Network Level์—์„œ์˜ ์ˆ˜ํ–‰ ์‹œ๊ฐ„ ๋ณ€ํ™” 72 4.5.5 ์ œ์•ˆํ•˜๋Š” ์ตœ์  PD_TH ์ถ”์ • ๋ฐฉ๋ฒ•์˜ ์ •ํ™•๋„ ๋ฐ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜๋ ด ์†๋„ 74 4.6 ๋ณธ ์žฅ์˜ ๊ฒฐ๋ก  75 ์ œ 5 ์žฅ Image processing์˜ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด์„ ํ™œ์šฉํ•œ PIM์— ์ ํ•ฉํ•œ lightweight GPU ๊ตฌ์กฐ 76 5.1 ๊ด€๋ จ ์—ฐ๊ตฌ 77 5.1.1 Processing In Memory 77 5.1.2 GPU์—์„œ์˜ CTA Scheduling 78 5.1.3 GPU์—์„œ์˜ Prefetching 78 5.2 Motivation 79 5.2.1 PIM GPU system์—์„œ Image Processing ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฒ˜๋ฆฌ ์‹œ ๊ธฐ์กด GPU ๊ตฌ์กฐ์˜ ๋น„ํšจ์œจ์„ฑ 79 5.3 ์ œ์•ˆํ•˜๋Š” GPU ๊ธฐ๋ฐ˜ PIM System 82 5.3.1 Overview 82 5.3.2 Access Pattern์„ ๊ณ ๋ คํ•œ CTA ํ• ๋‹น 83 5.3.3 PIM GPU ๊ตฌ์กฐ 90 5.4 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„ 94 5.4.1 ์‹คํ—˜ ํ™˜๊ฒฝ ๊ตฌ์„ฑ 94 5.4.2 In-Depth Analysis 95 5.4.3 ๊ธฐ์กด ์—ฐ๊ตฌ์™€์˜ ์„ฑ๋Šฅ ๋น„๊ต 98 5.4.4 Cache Miss Rate ๋ฐ Memory Traffic 102 5.4.5 ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋น„๊ต 103 5.4.6 PIM์˜ ๋ฉด์  ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰ ๋ถ„์„ 105 5.5 ๋ณธ ์žฅ์˜ ๊ฒฐ๋ก  107 ์ œ 6 ์žฅ ๊ฒฐ๋ก  108 ์ฐธ๊ณ ๋ฌธํ—Œ 110 Abstract 118๋ฐ•

    Towards Data Reliable, Low-Power, and Repairable Resistive Random Access Memories

    Get PDF
    A series of breakthroughs in memristive devices have demonstrated the potential of memristor arrays to serve as next generation resistive random access memories (ReRAM), which are fast, low-power, ultra-dense, and non-volatile. However, memristors' unique device characteristics also make them prone to several sources of error. Owing to the stochastic filamentary nature of memristive devices, various recoverable errors can affect the data reliability of a ReRAM. Permanent device failures further limit the lifetime of a ReRAM. This dissertation developed low-power solutions for more reliable and longer-enduring ReRAM systems. In this thesis, we first look into a data reliability issue known as write disturbance. Writing into a memristor in a crossbar could disturb the stored values in other memristors that are on the same memory line as the target cell. Such disturbance is accumulative over time which may lead to complete data corruption. To address this problem, we propose the use of two regular memristors on each word to keep track of the disturbance accumulation and trigger a refresh to restore the weakened data, once it becomes necessary. We also investigate the considerable variation in the write-time characteristics of individual memristors. With such variation, conventional fixed-pulse write schemes not only waste significant energy, but also cannot guarantee reliable completion of the write operations. We address such variation by proposing an adaptive write scheme that adjusts the width of the write pulses for each memristor. Our scheme embeds an online monitor to detect the completion of a write operation and takes into account the parasitic effect of line-shared devices in access-transistor-free memristive arrays. We further investigate the use of this method to shorten the test time of memory march algorithms by eliminating the need of a verifying read right after a write, which is commonly employed in the test sequences of march algorithms.Finally, we propose a novel mechanism to extend the lifetime of a ReRAM by protecting it against hard errors through the exploitation of a unique feature of bipolar memristive devices. Our solution proposes an unorthodox use of complementary resistive switches (a particular implementation of memristive devices) to provide an ``in-place spare'' for each memory cell at negligible extra cost. The in-place spares are then utilized by a repair scheme to repair memristive devices that have failed at a stuck-at-ON state at a page-level granularity. Furthermore, we explore the use of in-place spares in lieu of other memory reliability and yield enhancement solutions, such as error correction codes (ECC) and spare rows. We demonstrate that with the in-place spares, we can yield the same lifetime as a baseline ReRAM with either significantly fewer spare rows or a lighter-weight ECC, both of which can save on energy consumption and area

    Associative Memristive Memory for Approximate Computing in GPUs

    No full text
    Using associative memories to enable computing-with-memory is a promising approach to improve energy efficiency. Associative memories can be tightly coupled with processing elements to restore and later recall function responses for a subset of input values. This approach avoids the actual function execution on the processing element to save on energy. The challenge, however, is to reduce the energy consumption of associative memory modules themselves. Here we address the challenge of designing ultra-low-power associative memories. We use memristive parts for memory implementation and demonstrate the energy saving potential of integrating associative memristive memory (AMM) into graphics processing units (GPUs). To reduce the energy consumption of AMM modules, we leverage approximate computing which benefits from application-level tolerance to errors: We employ voltage overscaling on AMM modules which deliberately relaxes its searching criteria to approximately match stored patterns within a 2 bit Hamming distance of the search pattern. This introduces some errors to the computation that are tolerable for target applications. We further reduce the energy consumption by employing purely resistive crossbar architectures for AMM modules. To evaluate the proposed architecture, we integrate AMM modules with floating point units in an AMD Southern Islands GPU and run four image processing kernels on an AMM-integrated GPU. Our experimental results show that employing AMM modules reduces energy consumption of running these kernels by 23%-45%, compared to a baseline GPU without AMM. The image processing kernels tolerate errors resulting from approximate search operations, maintaining an acceptable image quality, i.e., a PSNR above 30 dB

    Associative Memristive Memory for Approximate Computing in GPUs

    No full text
    corecore