11 research outputs found
Design Guidelines for High-Performance SCM Hierarchies
With emerging storage-class memory (SCM) nearing commercialization, there is
evidence that it will deliver the much-anticipated high density and access
latencies within only a few factors of DRAM. Nevertheless, the
latency-sensitive nature of memory-resident services makes seamless integration
of SCM in servers questionable. In this paper, we ask the question of how best
to introduce SCM for such servers to improve overall performance/cost over
existing DRAM-only architectures. We first show that even with the most
optimistic latency projections for SCM, the higher memory access latency
results in prohibitive performance degradation. However, we find that
deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the
performance of an SCM-mostly memory system competitive. The high degree of
spatial locality that memory-resident services exhibit not only simplifies the
DRAM cache's design as page-based, but also enables the amortization of
increased SCM access latencies and the mitigation of SCM's read/write latency
disparity.
We identify the set of memory hierarchy design parameters that plays a key
role in the performance and cost of a memory system combining an SCM technology
and a 3D stacked DRAM cache. We then introduce a methodology to drive
provisioning for each of these design parameters under a target
performance/cost goal. Finally, we use our methodology to derive concrete
results for specific SCM technologies. With PCM as a case study, we show that a
two bits/cell technology hits the performance/cost sweet spot, reducing the
memory subsystem cost by 40% while keeping performance within 3% of the best
performing DRAM-only system, whereas single-level and triple-level cell
organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1
์ด์ง ๋ด๋ด ๋คํธ์ํฌ๋ฅผ ์ํ DRAM ๊ธฐ๋ฐ์ ๋ด๋ด ๋คํธ์ํฌ ๊ฐ์๊ธฐ ๊ตฌ์กฐ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. ์ ์น์ฃผ.In the convolutional neural network applications, most computations occurred by the multiplication and accumulation of the convolution and fully-connected layers. From the hardware perspective (i.e., in the gate-level circuits), these operations are performed by many dot-products between the feature map and kernel vectors. Since the feature map and kernel have the matrix form, the vector converted from 3D, or 4D matrices is reused many times for the matrix multiplications. As the throughput of the DNN increases, the power consumption and performance bottleneck due to the data movement become a more critical issue. More importantly, power consumption due to off-chip memory accesses dominates total power since off-chip memory access consumes several hundred times greater power than the computation. The accelerators' throughput is about several hundred GOPS~several TOPS, but Memory bandwidth is less than 25.6 or 34 GB/s (with DDR4 or LPDDR4).
By reducing the network size and/or data movement size, both data movement power and performance bottleneck problems are improved. Among the algorithms, Quantization is widely used. Binary Neural Networks (BNNs) dramatically reduce precision down to 1 bit. The accuracy is much lower than that of the FP16, but the accuracy is continuously improving through various studies. With the data flow control, there is a method of reducing redundant data movement by increasing data reuse. The above two methods are widely applied in accelerators because they do not need additional computations in the inference computation.
In this dissertation, I present 1) a DRAM-based accelerator architecture and 2) a DRAM refresh method to improve performance reduction due to DRAM refresh. Both methods are orthogonal, so can be integrated into the DRAM chip and operate independently.
First, we proposed a DRAM-based accelerator architecture capable of massive and large vector dot product operation. In the field of CNN accelerators to which BNN can be applied, a computing-in-memory (CIM) structure that utilizes a cell-array structure of Memory for vector dot product operation is being actively studied. Since DRAM stores all the neural network data, it is advantageous to reduce the amount of data transfer. The proposed architecture operates by utilizing the basic operation of the DRAM.
The second method is to reduce the performance degradation and power consumption caused by DRAM refresh. Since the DRAM cannot read and write data while performing a periodic refresh, system performance decreases. The proposed refresh method tests the refresh characteristics inside the DRAM chip during self-refresh and increases the refresh cycle according to the characteristics. Since it operates independently inside DRAM, it can be applied to all systems using DRAM and is the same for deep neural network accelerators.
We surveyed system integration with a software stack to use the in-DRAM accelerator in the DL framework. As a result, it is expected to control in-DRAM accelerators with the memory controller implementation method verified in the previous experiment. Also, we have added the performance simulation function of in-DRAM accelerator to PyTorch. When running a neural network in PyTorch, it reports the computation latency and data movement latency occurring in the layer running in the in-DRAM accelerator. It is a significant advantage to predict the performance when running in hardware while co-designing the network.์ปจ๋ณผ๋ฃจ์
๋ ๋ด๋ด ๋คํธ์ํฌ (CNN) ์ดํ๋ฆฌ์ผ์ด์
์์๋, ๋๋ถ๋ถ์ ์ฐ์ฐ์ด ์ปจ๋ณผ๋ฃจ์
๋ ์ด์ด์ ํ๋ฆฌ-์ปค๋ฅํฐ๋ ๋ ์ด์ด์์ ๋ฐ์ํ๋ ๊ณฑ์
๊ณผ ๋์ ์ฐ์ฐ์ด๋ค. ๊ฒ์ดํธ-๋ก์ง ๋ ๋ฒจ์์๋, ๋๋์ ๋ฒกํฐ ๋ด์ ์ผ๋ก ์คํ๋๋ฉฐ, ์
๋ ฅ๊ณผ ์ปค๋ ๋ฒกํฐ๋ค์ ๋ฐ๋ณตํด์ ์ฌ์ฉํ์ฌ ์ฐ์ฐํ๋ค. ๋ฅ ๋ด๋ด ๋คํธ์ํฌ ์ฐ์ฐ์๋ ๋ฒ์ฉ ์ฐ์ฐ ์ ๋๋ณด๋ค, ๋จ์ํ ์ฐ์ฐ์ด ๊ฐ๋ฅํ ์์ ์ฐ์ฐ ์ ๋์ ๋๋์ผ๋ก ์ฌ์ฉํ๋ ๊ฒ์ด ์ ํฉํ๋ค. ๊ฐ์๊ธฐ์ ์ฑ๋ฅ์ด ์ผ์ ์ด์ ๋์์ง๋ฉด, ๊ฐ์๊ธฐ์ ์ฑ๋ฅ์ ์ฐ์ฐ์ ํ์ํ ๋ฐ์ดํฐ ์ ์ก์ ์ํด ์ ํ๋๋ค. ๋ฉ๋ชจ๋ฆฌ์์ ๋ฐ์ดํฐ๋ฅผ ์คํ-์นฉ์ผ๋ก ์ ์กํ ๋์ ์๋์ง ์๋ชจ๊ฐ, ์ฐ์ฐ ์ ๋์์ ์ฐ์ฐ์ ์ฌ์ฉ๋๋ ์๋์ง์ ์๋ฐฑ๋ฐฐ๋ก ํฌ๋ค. ๋ํ ์ฐ์ฐ๊ธฐ์ ์ฑ๋ฅ์ ์ด๋น ์๋ฐฑ ๊ธฐ๊ฐ~์ ํ
๋ผ-์ฐ์ฐ์ด ๊ฐ๋ฅํ์ง๋ง, ๋ฉ๋ชจ๋ฆฌ์ ๋ฐ์ดํฐ ์ ์ก์ ์ด๋น ์์ญ ๊ธฐ๊ฐ ๋ฐ์ดํธ์ด๋ค.
๋ฐ์ดํฐ ์ ์ก์ ์ํ ํ์์ ์ฑ๋ฅ ๋ฌธ์ ๋ฅผ ๋์์ ํด๊ฒฐํ๋ ๋ฐฉ๋ฒ์, ์ ์ก๋๋ ๋ฐ์ดํฐ ํฌ๊ธฐ๋ฅผ ์ค์ด๋ ๊ฒ์ด๋ค. ์๊ณ ๋ฆฌ์ฆ ์ค์์๋ ๋คํธ์ํฌ์ ๋ฐ์ดํฐ๋ฅผ ์์ํํ์ฌ, ๋ฎ์ ์ ๋ฐ๋๋ก ๋ฐ์ดํฐ๋ฅผ ํํํ๋ ๋ฐฉ๋ฒ์ด ๋๋ฆฌ ์ฌ์ฉ๋๋ค. ์ด์ง ๋ด๋ด ๋คํธ์ํฌ(BNN)๋ ์ ๋ฐ๋๋ฅผ 1๋นํธ๊น์ง ๊ทน๋จ์ ์ผ๋ก ๋ฎ์ถ๋ค. 16๋นํธ ์ ๋ฐ๋๋ณด๋ค ๋คํธ์ํฌ์ ์ ํ๋๊ฐ ๋ฎ์์ง๋ ๋ฌธ์ ๊ฐ ์์ง๋ง, ๋ค์ํ ์ฐ๊ตฌ๋ฅผ ํตํด ์ ํ๋๊ฐ ์ง์์ ์ผ๋ก ๊ฐ์ ๋๊ณ ์๋ค. ๋ํ ๊ตฌ์กฐ์ ์ผ๋ก๋, ์ ์ก๋ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฌ์ฉํ์ฌ ๋์ผํ ๋ฐ์ดํฐ์ ๋ฐ๋ณต์ ์ธ ์ ์ก์ ์ค์ด๋ ๋ฐฉ๋ฒ์ด ์๋ค. ์์ ๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ถ๋ก ๊ณผ์ ์์ ๋ณ๋์ ์ฐ์ฐ ์์ด ์ ์ฉ ๊ฐ๋ฅํ์ฌ ๊ฐ์๊ธฐ์์ ๋๋ฆฌ ์ ์ฉ๋๊ณ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋, DRAM ๊ธฐ๋ฐ์ ๊ฐ์๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ ์ํ๊ณ , DRAM refresh์ ์ํ ์ฑ๋ฅ ๊ฐ์๋ฅผ ๊ฐ์ ํ๋ ๊ธฐ์ ์ ์ ์ํ์๋ค. ๋ ๋ฐฉ๋ฒ์ ํ๋์ DRAM ์นฉ์ผ๋ก ์ง์ ๊ฐ๋ฅํ๋ฉฐ, ๋
๋ฆฝ์ ์ผ๋ก ๊ตฌ๋ ๊ฐ๋ฅํ๋ค.
์ฒซ๋ฒ์งธ๋ ๋๋์ ๋ฒกํฐ ๋ด์ ์ฐ์ฐ์ด ๊ฐ๋ฅํ DRAM ๊ธฐ๋ฐ ๊ฐ์๊ธฐ์ ๋ํ ์ฐ๊ตฌ์ด๋ค. BNN์ ์ ์ฉํ ์ ์๋ CNN๊ฐ์๊ธฐ ๋ถ์ผ์์, ๋ฉ๋ชจ๋ฆฌ์ ์
-์ด๋ ์ด ๊ตฌ์กฐ๋ฅผ ๋ฒกํฐ ๋ด์ ์ฐ์ฐ์ ํ์ฉํ๋ ์ปดํจํ
-์ธ-๋ฉ๋ชจ๋ฆฌ(CIM) ๊ตฌ์กฐ๊ฐ ํ๋ฐํ ์ฐ๊ตฌ๋๊ณ ์๋ค. ํนํ, DRAM์๋ ๋ด๋ด ๋คํธ์ํฌ์ ๋ชจ๋ ๋ฐ์ดํฐ๊ฐ ์๊ธฐ ๋๋ฌธ์, ๋ฐ์ดํฐ ์ ์ก๋์ ๊ฐ์์ ์ ๋ฆฌํ๋ค. ์ฐ๋ฆฌ๋ DRAM ์
-์ด๋ ์ด์ ๊ตฌ์กฐ๋ฅผ ๋ฐ๊พธ์ง ์๊ณ , DRAM์ ๊ธฐ๋ณธ ๋์์ ํ์ฉํ์ฌ ์ฐ์ฐํ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค.
๋๋ฒ์งธ๋ DRAM ๋ฆฌํ๋ ์ฌ ์ฃผ๊ธฐ๋ฅผ ๋๋ ค์ ์ฑ๋ฅ ์ดํ์ ํ์ ์๋ชจ๋ฅผ ๊ฐ์ ํ๋ ๋ฐฉ๋ฒ์ด๋ค. DRAM์ด ๋ฆฌํ๋ ์ฌ๋ฅผ ์คํํ ๋๋ง๋ค, ๋ฐ์ดํฐ๋ฅผ ์ฝ๊ณ ์ธ ์ ์๊ธฐ ๋๋ฌธ์ ์์คํ
ํน์ ๊ฐ์๊ธฐ์ ์ฑ๋ฅ ๊ฐ์๊ฐ ๋ฐ์ํ๋ค. DRAM ์นฉ ๋ด๋ถ์์ DRAM์ ๋ฆฌํ๋ ์ฌ ํน์ฑ์ ํ
์คํธํ๊ณ , ๋ฆฌํ๋ ์ฌ ์ฃผ๊ธฐ๋ฅผ ๋๋ฆฌ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค. DRAM ๋ด๋ถ์์ ๋
๋ฆฝ์ ์ผ๋ก ๋์ํ๊ธฐ ๋๋ฌธ์ DRAM์ ์ฌ์ฉํ๋ ๋ชจ๋ ์์คํ
์ ์ ์ฉ ๊ฐ๋ฅํ๋ฉฐ, ๋ฅ ๋ด๋ด ๋คํธ์ํฌ ๊ฐ์๊ธฐ์์๋ ๋์ผํ๋ค.
๋ํ, ์ ์๋ ๊ฐ์๊ธฐ๋ฅผ PyTorch์ ๊ฐ์ด ๋๋ฆฌ ์ฌ์ฉ๋๋ ๋ฅ๋ฌ๋ ํ๋ ์ ์ํฌ์์๋ ์ฝ๊ฒ ์ฌ์ฉํ ์ ์๋๋ก, ์ํํธ์จ์ด ์คํ์ ๋น๋กฏํ system integration ๋ฐฉ๋ฒ์ ์กฐ์ฌํ์๋ค. ๊ฒฐ๊ณผ์ ์ผ๋ก, ๊ธฐ์กด์ TVM compiler์ FPGA๋ก ๊ตฌํํ๋ TVM/VTA ๊ฐ์๊ธฐ์, DRAM refresh ์คํ์์ ๊ฒ์ฆ๋ ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ์ ์ปค์คํ
์ปดํ์ผ๋ฌ๋ฅผ ์ถ๊ฐํ๋ฉด in-DRAM ๊ฐ์๊ธฐ๋ฅผ ์ ์ดํ ์ ์์ ๊ฒ์ผ๋ก ๊ธฐ๋๋๋ค. ์ด์ ๋ํ์ฌ, in-DRAM ๊ฐ์๊ธฐ์ ๋ด๋ด ๋คํธ์ํฌ์ ์ค๊ณ ๋จ๊ณ์์ ์ฑ๋ฅ์ ์์ธกํ ์ ์๋๋ก, ์๋ฎฌ๋ ์ด์
๊ธฐ๋ฅ์ PyTorch์ ์ถ๊ฐํ์๋ค. PyTorch์์ ์ ๊ฒฝ๋ง์ ์คํํ ๋, DRAM ๊ฐ์๊ธฐ์์ ์คํ๋๋ ๊ณ์ธต์์ ๋ฐ์ํ๋ ๊ณ์ฐ ๋๊ธฐ ์๊ฐ ๋ฐ ๋ฐ์ดํฐ ์ด๋ ์๊ฐ์ ํ์ธํ ์ ์๋ค.Abstract i
Contents viii
List of Tables x
List of Figures xiv
Chapter 1 Introduction 1
Chapter 2 Background 6
2.1 Neural Network Operation . . . . . . . . . . . . . . . . 6
2.2 Data Movement Overhead . . . . . . . . . . . . . . . . 7
2.3 Binary Neural Networks . . . . . . . . . . . . . . . . . 10
2.4 Computing-in-Memory . . . . . . . . . . . . . . . . . . 11
2.5 Memory Bottleneck due to Refresh . . . . . . . . . . . . 13
Chapter 3 In-DRAM Neural Network Accelerator 16
3.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 DRAM hierarchy . . . . . . . . . . . . . . . . . 18
3.1.2 DRAM Basic Operation . . . . . . . . . . . . . 21
3.1.3 DRAM Commands with Timing Parameters . . . 22
3.1.4 Bit-wise Operation in DRAM . . . . . . . . . . 25
3.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Proposed architecture . . . . . . . . . . . . . . . . . . . 30
3.3.1 Operation Examples of Row Operator . . . . . . 32
3.3.2 Convolutions on DRAM Chip . . . . . . . . . . 39
3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Input Broadcasting in DRAM . . . . . . . . . . 44
3.4.2 Input Data Movement With M2V . . . . . . . . . 47
3.4.3 Internal Data Movement With SiD . . . . . . . . 49
3.4.4 Data Partitioning for Parallel Operation . . . . . 52
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Performance Estimation . . . . . . . . . . . . . 56
3.5.2 Configuration of In-DRAM Accelerator . . . . . 58
3.5.3 Improving the Accuracy of BNN . . . . . . . . . 60
3.5.4 Comparison with the Existing Works . . . . . . . 62
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.1 Performance Comparison with ASIC Accelerators 67
3.6.2 Challenges of The Proposed Architecture . . . . 70
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 4 Reducing DRAM Refresh Power Consumption
by Runtime Profiling of Retention Time and Dualrow
Activation 74
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Solution overview . . . . . . . . . . . . . . . . . . . . . 88
4.6 Runtime profiling . . . . . . . . . . . . . . . . . . . . . 93
4.6.1 Basic Operation . . . . . . . . . . . . . . . . . . 93
4.6.2 Profiling Multiple Rows in Parallel . . . . . . . . 96
4.6.3 Temperature, Data Backup and Error Check . . . 96
4.7 Dual-row Activation . . . . . . . . . . . . . . . . . . . . 98
4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 102
4.8.1 Experimental Setup . . . . . . . . . . . . . . . . 103
4.8.2 Refresh Period Improvement . . . . . . . . . . . 107
4.8.3 Power Reduction . . . . . . . . . . . . . . . . . 110
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 5 System Integration 118
5.1 Integrate The Proposed Methods . . . . . . . . . . . . . 118
5.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 6 Conclusion 129
Bibliography 131
๊ตญ๋ฌธ์ด๋ก 153Docto
Understanding and Improving the Latency of DRAM-Based Memory Systems
Over the past two decades, the storage capacity and access bandwidth of main
memory have improved tremendously, by 128x and 20x, respectively. These
improvements are mainly due to the continuous technology scaling of DRAM
(dynamic random-access memory), which has been used as the physical substrate
for main memory. In stark contrast with capacity and bandwidth, DRAM latency
has remained almost constant, reducing by only 1.3x in the same time frame.
Therefore, long DRAM latency continues to be a critical performance bottleneck
in modern systems. Increasing core counts, and the emergence of increasingly
more data-intensive and latency-critical applications further stress the
importance of providing low-latency memory access.
In this dissertation, we identify three main problems that contribute
significantly to long latency of DRAM accesses. To address these problems, we
present a series of new techniques. Our new techniques significantly improve
both system performance and energy efficiency. We also examine the critical
relationship between supply voltage and latency in modern DRAM chips and
develop new mechanisms that exploit this voltage-latency trade-off to improve
energy efficiency.
The key conclusion of this dissertation is that augmenting DRAM architecture
with simple and low-cost features, and developing a better understanding of
manufactured DRAM chips together lead to significant memory latency reduction
as well as energy efficiency improvement. We hope and believe that the proposed
architectural techniques and the detailed experimental data and observations on
real commodity DRAM chips presented in this dissertation will enable
development of other new mechanisms to improve the performance, energy
efficiency, or reliability of future memory systems.Comment: PhD Dissertatio
Efficient fault tolerance for selected scientific computing algorithms on heterogeneous and approximate computer architectures
Scientific computing and simulation technology play an essential role to solve central challenges in science and engineering. The high computational power of heterogeneous computer architectures allows to accelerate applications in these domains, which are often dominated by compute-intensive mathematical tasks. Scientific, economic and political decision processes increasingly rely on such applications and therefore induce a strong demand to compute correct and trustworthy results. However, the continued semiconductor technology scaling increasingly imposes serious threats to the reliability and efficiency of upcoming devices. Different reliability threats can cause crashes or erroneous results without indication. Software-based fault tolerance techniques can protect algorithmic tasks by adding appropriate operations to detect and correct errors at runtime. Major challenges are induced by the runtime overhead of such operations and by rounding errors in floating-point arithmetic that can cause false positives. The end of Dennard scaling induces central challenges to further increase the compute efficiency between semiconductor technology generations. Approximate computing exploits the inherent error resilience of different applications to achieve efficiency gains with respect to, for instance, power, energy, and execution times. However, scientific applications often induce strict accuracy requirements which require careful utilization of approximation techniques.
This thesis provides fault tolerance and approximate computing methods that enable the reliable and efficient execution of linear algebra operations and Conjugate Gradient solvers using heterogeneous and approximate computer architectures. The presented fault tolerance techniques detect and correct errors at runtime with low runtime overhead and high error coverage. At the same time, these fault tolerance techniques are exploited to enable the execution of the Conjugate Gradient solvers on approximate hardware by monitoring the underlying error resilience while adjusting the approximation error accordingly. Besides, parameter evaluation and estimation methods are presented that determine the computational efficiency of application executions on approximate hardware.
An extensive experimental evaluation shows the efficiency and efficacy of the presented methods with respect to the runtime overhead to detect and correct errors, the error coverage as well as the achieved energy reduction in executing the Conjugate Gradient solvers on approximate hardware