4 research outputs found
Memory Centric Characterization and Analysis of SPEC CPU2017 Suite
In this paper we provide a comprehensive, memory-centric characterization of
the SPEC CPU2017 benchmark suite, using a number of mechanisms including
dynamic binary instrumentation, measurements on native hardware using hardware
performance counters and OS based tools.
We present a number of results including working set sizes, memory capacity
consumption and, memory bandwidth utilization of various workloads. Our
experiments reveal that the SPEC CPU2017 workloads are surprisingly memory
intensive, with approximately 50% of all dynamic instructions being memory
intensive ones. We also show that there is a large variation in the memory
footprint and bandwidth utilization profiles of the entire suite, with some
benchmarks using as much as 16 GB of main memory and up to 2.3 GB/s of memory
bandwidth.
We also perform instruction execution and distribution analysis of the suite
and find that the average instruction count for SPEC CPU2017 workloads is an
order of magnitude higher than SPEC CPU2006 ones. In addition, we also find
that FP benchmarks of the SPEC 2017 suite have higher compute requirements: on
average, FP workloads execute three times the number of compute operations as
compared to INT workloads.Comment: 12 pages, 133 figures, A short version of this work has been
published at "Proceedings of the 2019 ACM/SPEC International Conference on
Performance Engineering
RAMPART: RowHammer Mitigation and Repair for Server Memory Systems
RowHammer attacks are a growing security and reliability concern for DRAMs
and computer systems as they can induce many bit errors that overwhelm error
detection and correction capabilities. System-level solutions are needed as
process technology and circuit improvements alone are unlikely to provide
complete protection against RowHammer attacks in the future. This paper
introduces RAMPART, a novel approach to mitigating RowHammer attacks and
improving server memory system reliability by remapping addresses in each DRAM
in a way that confines RowHammer bit flips to a single device for any victim
row address. When RAMPART is paired with Single Device Data Correction (SDDC)
and patrol scrub, error detection and correction methods in use today, the
system can detect and correct bit flips from a successful attack, allowing the
memory system to heal itself. RAMPART is compatible with DDR5 RowHammer
mitigation features, as well as a wide variety of algorithmic and probabilistic
tracking methods. We also introduce BRC-VL, a variation of DDR5 Bounded Refresh
Configuration (BRC) that improves system performance by reducing mitigation
overhead and show that it works well with probabilistic sampling methods to
combat traditional and victim-focused mitigation attacks like Half-Double. The
combination of RAMPART, SDDC, and scrubbing enables stronger RowHammer
resistance by correcting bit flips from one successful attack. Uncorrectable
errors are much less likely, requiring two successful attacks before the memory
system is scrubbed.Comment: 16 pages, 13 figures. A version of this paper will appear in the
Proceedings of MEMSYS2
๋ฉ๋ชจ๋ฆฌ ๊ฐ์ ์ฑ๋์ ํตํ ๋ผ์คํธ ๋ ๋ฒจ ์บ์ ํํฐ์ ๋
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2023. 2. ๊น์ฅ์ฐ.Ensuring fairness or providing isolation between multiple workloads with distinct characteristics that are collocated on a single, shared-memory system is a challenge. Recent multicore processors provide last-level cache (LLC) hardware partitioning to provide hardware support for isolation, with the cache partitioning often specified by the user. While more LLC capacity often results in higher performance, in this dissertation we identify that a workload allocated more LLC capacity result in worse performance on real-machine experiments, which we refer to as MiW (more is worse).
Through various controlled experiments, we identify that another workload with less LLC capacity causes more frequent LLC misses. The workload stresses the main memory system shared by both workloads and degrades the performance of the former workload even if LLC partitioning is used (a balloon effect).
To resolve this problem, we propose virtualizing the data path of main memory controllers and dedicating the memory virtual channels (mVCs) to each group of applications, grouped for LLC partitioning. mVC can further fine-tune the performance of groups by differentiating buffer sizes among mVCs. It can reduce the total system cost by executing latency-critical and throughput-oriented workloads together on shared machines, of which performance criteria can be achieved only on dedicated machines if mVCs are not supported. Experiments on a simulated chip multiprocessor show that our proposals effectively eliminate the MiW phenomenon, hence providing additional opportunities for workload consolidation in a datacenter. Our case study demonstrates potential savings of machine count by 21.8% with mVC, which would otherwise violate a service level objective (SLO).์ต๊ทผ ๋ฉํฐ์ฝ์ด ํ๋ก์ธ์ ๊ธฐ๋ฐ ์์คํ
์ ํ๊ณ ๋ฐ ์
๊ณ์ ์ฃผ๋ชฉ์ ๋ฐ๊ณ ์์ผ๋ฉฐ, ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์๋ค. ๋ฉํฐ์ฝ์ด ํ๋ก์ธ์ ๊ธฐ๋ฐ ์์คํ
์ ์๋ก ๋ค๋ฅธ ํน์ฑ์ ๊ฐ์ง ์ฌ๋ฌ ์์ฉ ํ๋ก๊ทธ๋จ๋ค์ด ๋์์ ์คํ๋๋๋ฐ, ์ด ๋ ์์ฉ ํ๋ก๊ทธ๋จ๋ค์ ์์คํ
์ ์ฌ๋ฌ ์์๋ค์ ๊ณต์ ํ๊ฒ ๋๋ค. ๋ํ์ ์ธ ๊ณต์ ์์์ ์๋ก๋ ๋ผ์คํธ ๋ ๋ฒจ ์บ์(LLC) ๋ฐ ๋ฉ์ธ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ๋ค ์ ์๋ค. ์ด๋ฌํ ๋จ์ผ ๊ณต์ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์์ ์๋ก ๋ค๋ฅธ ํน์ฑ์ ๊ฐ์ง ์ฌ๋ฌ ์์ฉ ํ๋ก๊ทธ๋จ๋ค ๊ฐ์ ๊ณต์ ์์์ ๊ณต์ ์ฑ์ ๋ณด์ฅํ๊ฑฐ๋ ํน์ ์์ฉ ํ๋ก๊ทธ๋จ์ด ๋ค๋ฅธ ์์ฉ ํ๋ก๊ทธ๋จ์ผ๋ก๋ถํฐ ๊ฐ์ญ์ ๋ฐ์ง ์๋๋ก ๊ฒฉ๋ฆฌํ๋ ๊ฒ์ ์ด๋ ค์ด ์ผ์ด๋ค.
์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ์ฌ ์ต๊ทผ ๋ฉํฐ์ฝ์ด ํ๋ก์ธ์๋ LLC ํํฐ์
๋์ ํ๋์จ์ด์ ์ผ๋ก ์ ๊ณตํ๊ธฐ ์์ํ์๋ค. ์ฌ์ฉ์๋ ํ๋์จ์ด์ ์ผ๋ก ์ ๊ณต๋ LLC ํํฐ์
๋์ ํตํด ํน์ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ํ๋ ์์ค๋งํผ LLC๋ฅผ ํ ๋นํ์ฌ ๋ค๋ฅธ ์์ฉ ํ๋ก๊ทธ๋จ์ผ๋ก๋ถํฐ ๊ฐ์ญ์ ๋ฐ์ง ์๋๋ก ๊ฒฉ๋ฆฌํ ์ ์๊ฒ ๋์๋ค. ์ผ๋ฐ์ ์ธ ๊ฒฝ์ฐ LLC ์ฉ๋์ ๋ง์ด ํ ๋น ๋ฐ์์๋ก ์ฑ๋ฅ์ด ํฅ์๋๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ง๋ง, ๋ณธ ์ฐ๊ตฌ์์๋ ๋ ๋ง์ LLC ์ฉ๋์ ํ ๋น ๋ฐ์ ์์ฉ ํ๋ก๊ทธ๋จ์ด ์คํ๋ ค ์ฑ๋ฅ ์ ํ๋๋ค๋ ์ฌ์ค(MiW, more is worse)์ ํ๋์จ์ด์ ์คํ์ ํตํด ํ์ธํ์๋ค. ๋ค์ํ ํต์ ๋ ์คํ์ ํตํด LLC ํํฐ์
๋์ ํตํด LLC ์ฉ๋์ ์ ๊ฒ ํ ๋น ๋ฐ์ ์์ฉ ํ๋ก๊ทธ๋จ์ด LLC ๋ฏธ์ค๋ฅผ ๋ ์์ฃผ ๋ฐ์์ํจ๋ค๋ ์ฌ์ค์ ํ์ผ ํ ์ ์์๋ค. LLC ์ฉ๋์ ์ ๊ฒ ํ ๋น ๋ฐ์ ์์ฉ ํ๋ก๊ทธ๋จ์ ์์ฉ ํ๋ก๊ทธ๋จ๋ค์ด ๊ณต์ ํ๋ ๋ฉ์ธ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์ ์คํธ๋ ์ค๋ฅผ ๊ฐํ๊ณ , LLC ํํฐ์
๋์ ํตํด ์๋ก ๊ฒฉ๋ฆฌ๋ฅผ ํ์์์๋ ๋ถ๊ตฌํ๊ณ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ฑ๋ฅ์ ์ ํ์์ผฐ๋ค.
MiW ํ์์ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ์ฐ๊ตฌ์์๋ ๋ฉ์ธ ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ์ ๋ฐ์ดํฐ ๊ฒฝ๋ก๋ฅผ ๊ฐ์ํํ๊ณ LLC ํํฐ์
๋์ ์ํด ๊ทธ๋ฃนํ๋ ๊ฐ ์์ฉ ํ๋ก๊ทธ๋จ ๊ทธ๋ฃน์ ์ ์ฉ์ผ๋ก ํ ๋น๋๋ ๋ฉ๋ชจ๋ฆฌ ๊ฐ์ ์ฑ๋(mVC)์ ์ ์ํ์๋ค. mVC๋ฅผ ํตํด ๊ฐ ์์ฉ ํ๋ก๊ทธ๋จ ๊ทธ๋ฃน์ ๋
๋ฆฝ์ ์ธ ๋ฐ์ดํฐ ๊ฒฝ๋ก๋ฅผ ์์ ํ ๊ฒ์ฒ๋ผ ๊ฐ์ํ ๋๋ค. ๋ฐ๋ผ์ ํน์ ์์ฉ ํ๋ก๊ทธ๋จ ๊ทธ๋ฃน์ด ๋ฐ์ดํฐ ๊ฒฝ๋ก๋ฅผ ๋
์ ํ๋๋ผ๋ ๋ค๋ฅธ ์์ฉ ํ๋ก๊ทธ๋จ๋ค์ ์ฑ๋ฅ ์ ํ๋ฅผ ์ ๋ฐํ ์ ์๊ฒ ๋์ด ์๋ก ๊ฒฉ๋ฆฌ๋ ํ๊ฒฝ์ ์กฐ์ฑํ๋ค. ์ถ๊ฐ์ ์ผ๋ก mVC์ ๋ฒํผ ํฌ๊ธฐ๋ฅผ ์กฐ์ ํ์ฌ ์์ฉ ํ๋ก๊ทธ๋จ ๊ทธ๋ฃน์ ์ฑ๋ฅ ๋ฏธ์ธ ์กฐ์ ์ด ๊ฐ๋ฅํ๋๋ก ํ์๋ค.
mVC๋ฅผ ๋์
ํจ์ผ๋ก์จ ์ ์ฒด์ ์ธ ์์คํ
๋น์ฉ์ ์ค์ผ ์ ์๋ค. ์ง์ฐ ์๊ฐ์ด ์ค์ํ ์์ฉ ํ๋ก๊ทธ๋จ๊ณผ ์ฒ๋ฆฌ๋์ด ์ค์ํ ์์ฉ ํ๋ก๊ทธ๋จ์ ํจ๊ป ์คํํ ๋ mVC๊ฐ ์์ ๊ฒฝ์ฐ์๋ ์ง์ฐ ์๊ฐ์ ์ฑ๋ฅ ๊ธฐ์ค์น๋ฅผ ๋ง์กฑํ ์ ์์์ง๋ง, mVC๋ฅผ ํตํด ์ฑ๋ฅ ๊ธฐ์ค์น๋ฅผ ๋ง์กฑํ๋ฉด์ ์์คํ
์ ์ด ๋น์ฉ์ ๊ฐ์์ํฌ ์ ์์๋ค. ๋ฉํฐ ์นฉ ํ๋ก์ธ์๋ฅผ ์๋ฎฌ๋ ์ด์
ํ ์คํ ๊ฒฐ๊ณผ๋ MiW ํ์์ ํจ๊ณผ์ ์ผ๋ก ์ ๊ฑฐํจ์ ๋ณด์ฌ์ฃผ์๋ค. ๋ํ, ๋ฐ์ดํฐ ์ผํฐ์์ ์์ฉ ํ๋ก๊ทธ๋จ๋ค์ ๋์ ์คํ์ ์ํ ์ถ๊ฐ์ ์ธ ๊ฐ๋ฅ์ฑ์ ์ ๊ณตํ๋ ๊ฒ์ ๋ณด์ฌ์ฃผ์๋ค. ์ฌ๋ก ์ฐ๊ตฌ๋ฅผ ํตํด mVC๋ฅผ ๋์
ํ์ฌ ์์คํ
๋น์ฉ์ 21.8%๊น์ง ์ ์ฝํ ์ ์์์ ๋ณด์์ผ๋ฉฐ, mVC๋ฅผ ๋์
ํ์ง ์์ ๊ฒฝ์ฐ์๋ ์๋น์ค ๊ธฐ์ค(SLO)์ ๋ง์กฑํ์ง ์์์ ํ์ธํ์๋ค.1. Introduction 1
1.1 Research Contributions 5
1.2 Outline 6
2. Background 7
2.1 Cache Hierarchy and Policies 7
2.2 Cache Partitioning 10
2.3 Benchmarks 15
2.3.1 Working Set Size 16
2.3.2 Top-down Analysis 17
2.3.3 Profiling Tools 19
3. More-is-Worse Phenonmenon 21
3.1 More LLC Leading to Performance Drop 21
3.2 Synthetic Workload Evaluation 27
3.3 Impact on Latency-critical Workloads 31
3.4 Workload Analysis 33
3.5 The Root Cause of the MiW Phenomenon 35
3.6 Limitations of Existing Solutions 41
3.6.1 Memory Bandwidth Throttling 41
3.6.2 Fairness-aware Memory Scheduling 44
4. Virtualizing Memory Channels 49
4.1 Memory Virtual Channel (mVC) 50
4.2 mVC Buffer Allocation Strategies 52
4.3 Evaluation 57
4.3.1 Experimental Setup 57
4.3.2 Reproducing Hardware Results 59
4.3.3 Mitigating MiW through mVC 60
4.3.4 Evaluation on Four Groups 64
4.3.5 Potentials for Operating Cost Savings with mVC 66
5. Related Work 71
5.1 Component-wise QoS/Fairness for Shared Resources 71
5.2 Holistic Approaches to QoS/Fairness 73
5.3 MiW on Recent Architectures 74
6. Conclusion 76
6.1 Discussion 78
6.2 Future Work 79
Bibliography 81
๊ตญ๋ฌธ์ด๋ก 89๋ฐ