990 research outputs found
Optimisation of computational fluid dynamics applications on multicore and manycore architectures
This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems.
The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance.
The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces
Exploiting memory allocations in clusterized many-core architectures
Power-efficient architectures have become the most important feature required for future embedded systems. Modern
designs, like those released on mobile devices, reveal that clusterization is the way to improve energy efficiency. However, such
architectures are still limited by the memory subsystem (i.e., memory latency problems). This work investigates an alternative
approach that exploits on-chip data locality to a large extent, through distributed shared memory systems that permit efficient
reuse of on-chip mapped data in clusterized many-core architectures. First, this work reviews the current literature on memory
allocations and explore the limitations of cluster-based many-core architectures. Then, several memory allocations are introduced
and benchmarked scalability, performance and energy-wise, compared to the conventional centralized shared memory solution to
reveal which memory allocation is the most appropriate for future mobile architectures. Our results show that distributed shared
memory allocations bring performance gains and opportunities to reduce energy consumption
Medium access control in wireless network-on-chip: a context analysis
ยฉ 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Wireless on-chip communication is a promising candidate to address the performance and efficiency issues that arise when scaling current NoC techniques to manycore processors. A WNoC can serve global and broadcast traffic with ultra-low latency even in thousand-core chips, thus acting as a natural complement to conventional and throughput-oriented wireline NoCs. However, the development of MAC strategies needed to efficiently share the wireless medium among the increasing number of cores remains a considerable challenge given the singularities of the environment and the novelty of the research area. In this position article, we present a context analysis describing the physical constraints, performance objectives, and traffic characteristics of the on-chip communication paradigm. We summarize the main differences with respect to traditional wireless scenarios, and then discuss their implications on the design of MAC protocols for manycore WNoC, with the ultimate goal of kickstarting this arguably unexplored research area.Peer ReviewedPostprint (author's final draft
STT-RAM์ ์ด์ฉํ ์๋์ง ํจ์จ์ ์ธ ์บ์ ์ค๊ณ ๊ธฐ์
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ์ต๊ธฐ์.์ง๋ ์์ญ ๋
๊ฐ '๋ฉ๋ชจ๋ฆฌ ๋ฒฝ' ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์จ ์นฉ ์บ์์ ํฌ๊ธฐ๋ ๊พธ์คํ ์ฆ๊ฐํด์๋ค. ํ์ง๋ง ์ง๊ธ๊น์ง ์บ์์ ์ฃผ๋ก ์ฌ์ฉ๋์ด ์จ ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ธ SRAM์ ๋ฎ์ ์ง์ ๋์ ๋์ ๋๊ธฐ ์ ๋ ฅ ์๋ชจ๋ก ์ธํด ํฐ ์บ์๋ฅผ ๊ตฌ์ฑํ๋ ๋ฐ์๋ ์ ํฉํ์ง ์๋ค. ์ด๋ฌํ SRAM์ ๋จ์ ์ ๋ณด์ํ๊ธฐ ์ํด ๋ ๋์ ์ง์ ๋์ ๋ฎ์ ๋๊ธฐ ์ ๋ ฅ์ ์๋ชจํ๋ ์๋ก์ด ๋ฉ๋ชจ๋ฆฌ ๊ธฐ์ ์ธ STT-RAM์ผ๋ก SRAM์ ๋์ฒดํ๋ ๊ฒ์ด ์ ์๋์๋ค. ํ์ง๋ง STT-RAM์ ๋ฐ์ดํฐ๋ฅผ ์ธ ๋ ๋ง์ ์๋์ง์ ์๊ฐ์ ์๋นํ๊ธฐ ๋๋ฌธ์ ๋จ์ํ SRAM์ STT-RAM์ผ๋ก ๋์ฒดํ๋ ๊ฒ์ ์คํ๋ ค ์บ์ ์๋์ง ์๋น๋ฅผ ์ฆ๊ฐ์ํจ๋ค. ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ๋
ผ๋ฌธ์์๋ STT-RAM์ ์ด์ฉํ ์๋์ง ํจ์จ์ ์ธ ์บ์ ์ค๊ณ ๊ธฐ์ ๋ค์ ์ ์ํ๋ค.
์ฒซ ๋ฒ์งธ, ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์์ STT-RAM์ ํ์ฉํ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค. ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ๋ ๊ณ์ธต ๊ฐ์ ์ค๋ณต๋ ๋ฐ์ดํฐ๊ฐ ์๊ธฐ ๋๋ฌธ์ ํฌํจ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์ ๋น๊ตํ์ฌ ๋ ํฐ ์ ํจ ์ฉ๋์ ๊ฐ์ง๋ง, ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์์๋ ์์ ๋ ๋ฒจ ์บ์์์ ๋ด๋ณด๋ด์ง ๋ชจ๋ ๋ฐ์ดํฐ๋ฅผ ํ์ ๋ ๋ฒจ ์บ์์ ์จ์ผ ํ๋ฏ๋ก ๋ ๋ง์ ์์ ๋ฐ์ดํฐ๋ฅผ ์ฐ๊ฒ ๋๋ค. ์ด๋ฌํ ๋ฐฐํ์ ์บ์ ๊ณ์ธต ๊ตฌ์กฐ์ ํน์ฑ์ ์ฐ๊ธฐ ํน์ฑ์ด ๋จ์ ์ธ STT-RAM์ ํจ๊ป ํ์ฉํ๋ ๊ฒ์ ์ด๋ ต๊ฒ ํ๋ค. ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ณธ ๋
ผ๋ฌธ์์๋ ์ฌ์ฌ์ฉ ๊ฑฐ๋ฆฌ ์์ธก์ ๊ธฐ๋ฐ์ผ๋ก ํ๋ SRAM/STT-RAM ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ๋ฅผ ์ค๊ณํ์๋ค.
๋ ๋ฒ์งธ, ๋นํ๋ฐ์ฑ STT-RAM์ ์ด์ฉํด ์บ์๋ฅผ ์ค๊ณํ ๋ ๊ณ ๋ คํด์ผ ํ ์ ๋ค์ ๋ํด ๋ถ์ํ์๋ค. STT-RAM์ ๋นํจ์จ์ ์ธ ์ฐ๊ธฐ ๋์์ ์ค์ด๊ธฐ ์ํด ๋ค์ํ ํด๊ฒฐ๋ฒ๋ค์ด ์ ์๋์๋ค. ๊ทธ์ค ํ ๊ฐ์ง๋ STT-RAM ์์๊ฐ ๋ฐ์ดํฐ๋ฅผ ์ ์งํ๋ ์๊ฐ์ ์ค์ฌ (ํ๋ฐ์ฑ STT-RAM) ์ฐ๊ธฐ ํน์ฑ์ ํฅ์ํ๋ ๋ฐฉ๋ฒ์ด๋ค. STT-RAM์ ์ ์ฅ๋ ๋ฐ์ดํฐ๋ฅผ ์๋ ๊ฒ์ ํ๋ฅ ์ ์ผ๋ก ๋ฐ์ํ๊ธฐ ๋๋ฌธ์ ์ ์ฅ๋ ๋ฐ์ดํฐ๋ฅผ ์์ ์ ์ผ๋ก ์ ์งํ๊ธฐ ์ํด์๋ ์ค๋ฅ ์ ์ ๋ถํธ(ECC)๋ฅผ ์ด์ฉํด ์ฃผ๊ธฐ์ ์ผ๋ก ์ค๋ฅ๋ฅผ ์ ์ ํด์ฃผ์ด์ผ ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ STT-RAM ๋ชจ๋ธ์ ์ด์ฉํ์ฌ ํ๋ฐ์ฑ STT-RAM ์ค๊ณ ์์๋ค์ ๋ํด ๋ถ์ํ์๊ณ ์คํ์ ํตํด ํด๋น ์ค๊ณ ์์๋ค์ด ์บ์ ์๋์ง์ ์ฑ๋ฅ์ ์ฃผ๋ ์ํฅ์ ๋ณด์ฌ์ฃผ์๋ค.
๋ง์ง๋ง์ผ๋ก, ๋งค๋์ฝ์ด ์์คํ
์์์ ๋ถ์ฐ ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ๋ฅผ ์ค๊ณํ์๋ค. ๋จ์ํ ๊ธฐ์กด์ ํ์ด๋ธ๋ฆฌ๋ ์บ์์ ๋ถ์ฐ์บ์๋ฅผ ๊ฒฐํฉํ๋ฉด ํ์ด๋ธ๋ฆฌ๋ ์บ์์ ํจ์จ์ฑ์ ํฐ ์ํฅ์ ์ฃผ๋ SRAM ํ์ฉ๋๊ฐ ๋ฎ์์ง๋ค. ๋ฐ๋ผ์ ๊ธฐ์กด์ ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ์์์ ์๋์ง ๊ฐ์๋ฅผ ๊ธฐ๋ํ ์ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ๋ถ์ฐ ํ์ด๋ธ๋ฆฌ๋ ์บ์ ๊ตฌ์กฐ์์ SRAM ํ์ฉ๋๋ฅผ ๋์ผ ์ ์๋ ๋ ๊ฐ์ง ์ต์ ํ ๊ธฐ์ ์ธ ๋ฑ
ํฌ-๋ด๋ถ ์ต์ ํ์ ๋ฑ
ํฌ๊ฐ ์ต์ ํ ๊ธฐ์ ์ ์ ์ํ์๋ค. ๋ฑ
ํฌ-๋ด๋ถ ์ต์ ํ๋ highly-associative ์บ์๋ฅผ ํ์ฉํ์ฌ ๋ฑ
ํฌ ๋ด๋ถ์์ ์ฐ๊ธฐ ๋์์ด ๋ง์ ๋ฐ์ดํฐ๋ฅผ ๋ถ์ฐ์ํค๋ ๊ฒ์ด๊ณ ๋ฑ
ํฌ๊ฐ ์ต์ ํ๋ ์๋ก ๋ค๋ฅธ ์บ์ ๋ฑ
ํฌ์ ์ฐ๊ธฐ ๋์์ด ๋ง์ ๋ฐ์ดํฐ๋ฅผ ๊ณ ๋ฅด๊ฒ ๋ถ์ฐ์ํค๋ ์ต์ ํ ๋ฐฉ๋ฒ์ด๋ค.Over the last decade, the capacity of on-chip cache is continuously increased to mitigate the memory wall problem. However, SRAM, which is a dominant memory technology for caches, is not suitable for such a large cache because of its low density and large static power. One way to mitigate these downsides of the SRAM cache is replacing SRAM with a more efficient memory technology. Spin-Transfer Torque RAM (STT-RAM), one of the emerging memory technology, is a promising candidate for the alternative of SRAM. As a substitute of SRAM, STT-RAM can compensate drawbacks of SRAM with its non-volatility and small cell size. However, STT-RAM has poor write characteristics such as high write energy and long write latency and thus simply replacing SRAM to STT-RAM increases cache energy. To overcome those poor write characteristics of STT-RAM, this dissertation explores three different design techniques for energy-efficient cache using STT-RAM.
The first part of the dissertation focuses on combining STT-RAM with exclusive cache hierarchy. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the upper-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction.
The second part of the dissertation explores trade-offs in the design of volatile STT-RAM cache. Due to the inefficient write operation of STT-RAM, various solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT-RAM cell for better write characteristics at the cost of shortened retention time (i.e., volatile STT-RAM). Since the retention failure of STT-RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. With an analysis based on analytic STT-RAM model, we have conducted extensive experiments on various volatile STT-RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT-RAM with ECC and scrubbing.
The last part of the dissertation proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache architecture for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks.Abstract i
Contents iii
List of Figures vii
List of Tables xi
Chapter 1 Introduction 1
1.1 Exclusive Last-Level Hybrid Cache 2
1.2 Designing Volatile STT-RAM Cache 4
1.3 Distributed Hybrid Cache 5
Chapter 2 Background 9
2.1 STT-RAM 9
2.1.1 Thermal Stability 10
2.1.2 Read and Write Operation of STT-RAM 11
2.1.3 Failures of STT-RAM 11
2.1.4 Volatile STT-RAM 13
2.1.5 Related Work 14
2.2 Exclusive Last-Level Hybrid Cache 18
2.2.1 Cache Hierarchies 18
2.2.2 Related Work 19
2.3 Distributed Hybrid Cache 21
2.3.1 Prediction Hybrid Cache 21
2.3.2 Distributed Cache Partitioning 22
2.3.3 Related Work 23
Chapter 3 Exclusive Last-Level Hybrid Cache 27
3.1 Motivation 27
3.1.1 Exclusive Cache Hierarchy 27
3.1.2 Reuse Distance 29
3.2 Architecture 30
3.2.1 Reuse Distance Predictor 30
3.2.2 Hybrid Cache Architecture 32
3.3 Evaluation 34
3.3.1 Methodology 34
3.3.2 LLC Energy Consumption 35
3.3.3 Main Memory Energy Consumption 38
3.3.4 Performance 39
3.3.5 Area Overhead 39
3.4 Summary 39
Chapter 4 Designing Volatile STT-RAM Cache 41
4.1 Analysis 41
4.1.1 Retention Failure of a Volatile STT-RAM Cell 41
4.1.2 Memory Array Design 43
4.2 Evaluation 45
4.2.1 Methodology 45
4.2.2 Last-Level Cache Energy 46
4.2.3 Performance 51
4.3 Summary 52
Chapter 5 Distributed Hybrid Cache 55
5.1 Motivation 55
5.2 Architecture 58
5.2.1 Intra-Bank Optimization 59
5.2.2 Inter-Bank Optimization 63
5.2.3 Other Optimizations 67
5.3 Evaluation Methodology 69
5.4 Evaluation Results 73
5.4.1 Energy Consumption and Performance 73
5.4.2 Analysis of Intra-bank Optimization 76
5.4.3 Analysis of Inter-bank Optimization 78
5.4.4 Impact of Inter-Bank Optimization on Network Energy 79
5.4.5 Sensitivity Analysis 80
5.4.6 Implementation Overhead 81
5.5 Summary 82
Chapter 6 Conculsion 85
Bibliography 88
์ด๋ก 101Docto
Many-core and heterogeneous architectures: programming models and compilation toolchains
1noL'abstract รจ presente nell'allegato / the abstract is in the attachmentopen677. INGEGNERIA INFORMATInopartially_openembargoed_20211002Barchi, Francesc
Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins
Modern large-scale computing systems (data centers, supercomputers, cloud and
edge setups and high-end cyber-physical systems) employ heterogeneous
architectures that consist of multicore CPUs, general-purpose many-core GPUs,
and programmable FPGAs. The effective utilization of these architectures poses
several challenges, among which a primary one is power consumption. Voltage
reduction is one of the most efficient methods to reduce power consumption of a
chip. With the galloping adoption of hardware accelerators (i.e., GPUs and
FPGAs) in large datacenters and other large-scale computing infrastructures, a
comprehensive evaluation of the safe voltage reduction levels for each
different chip can be employed for efficient reduction of the total power. We
present a survey of recent studies in voltage margins reduction at the system
level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands
inserted by the silicon vendors can be exploited in all devices for significant
power savings. On average, voltage reduction can reach 12% in multicore CPUs,
20% in manycore GPUs and 39% in FPGAs.Comment: Accepted for publication in IEEE Transactions on Device and Materials
Reliabilit
- โฆ