45 research outputs found
Generating and auto-tuning parallel stencil codes
In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform.
A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation.
The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology.
The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific
code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different
hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance.
Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning โ which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, โ the system can also be used more productively than if the programmer had to fine-tune the code manually.
We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code
์๋ณํ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ ๊ฐ์ญ ์ค๋ฅ ์ํ ๋ฐ RMW ์ฑ๋ฅ ํฅ์ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021.8. ์ดํ์ฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems.
Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM.
The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably.
Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes.์๋ณํ ๋ฉ๋ชจ๋ฆฌ๋(PCM) ๋งค๋ ฅ์ ์ธ ํน์ฑ์ ํตํด ๋ฉ๋ชจ๋ฆฌ ์์คํ
์ ์๋ก์ด ์๋์ ์์์ ์๋ ธ๋ค. ๋ง์ ๋ฉ๋ชจ๋ฆฌ ๊ด๋ จ ์ ํ ์ ์กฐ์
์ฒด(์ : ์ธํ
, SK ํ์ด๋์ค, ์ผ์ฑ)๊ฐ ๊ด๋ จ ์ ํ ๊ฐ๋ฐ์ ๋ฐ์ฐจ๋ฅผ ๊ฐํ๊ณ ์๋ค. PCM์ ๋จ์ํ ๋๊ท๋ชจ ๋ฐ์ดํฐ๋ฒ ์ด์ค์๋ง ๊ตญํ๋์ง ์๊ณ ๋ค์ํ ์ํฉ์ ์ ์ฉ๋ ์ ์๋ค. ์๋ฅผ ๋ค์ด, PCM์ ๋นํ๋ฐ์ฑ์ผ๋ก ์ธํด ๋๊ธฐ ์ ๋ ฅ์ด ๋ฎ๋ค. ๋ฐ๋ผ์ ๊ณ์ฐ ์ง์ฝ์ ์ธ ์ ํ๋ฆฌ์ผ์ด์
๋๋ ๋ชจ๋ฐ์ผ ์ ํ๋ฆฌ์ผ์ด์
์(์ฆ, ๊ธด ๋ฉ๋ชจ๋ฆฌ ์ ํด ์๊ฐ) PCM ๊ธฐ๋ฐ ์ปดํจํ
์์คํ
์์ ์คํํ๊ธฐ์ ์ ํฉํ๋ค.
PCM์ ์ด๋ฌํ ๋งค๋ ฅ์ ์ธ ํน์ฑ์๋ ๋ถ๊ตฌํ๊ณ PCM์ ๋ฎ์ ์ ๋ขฐ์ฑ๊ณผ ๊ธด ๋๊ธฐ ์๊ฐ์ผ๋ก ์ธํด ์ฌ์ ํ ์ผ๋ฐ ์ฐ์
์์ฅ์์๋ DRAM๊ณผ ๋ค์ ๊ฒฉ์ฐจ๊ฐ ์๋ค. ํนํ ๋ฎ์ ์ ๋ขฐ์ฑ์ ์ง๋ ์์ญ ๋
๋์ PCM ๊ธฐ์ ์ ๋ฐ์ ์ ์ ํดํ๋ ๋ฌธ์ ๋ค. ๋ฐ๋์ฒด ๊ณต์ ๊ธฐ์ ์ด ์๋
์ ๊ฑธ์ณ ๋น ๋ฅด๊ฒ ์ถ์๋จ์ ๋ฐ๋ผ DRAM์ 10nm ๊ธ ๊ณต์ ๊ธฐ์ ์ ๋๋ฌํ์๋ค. ์ด์ด์, ์ฐ๊ธฐ ๋ฐฉํด ์ค๋ฅ (WDE)๊ฐ 54nm ๋ฑ๊ธ ํ๋ก์ธ์ค ๊ธฐ์ ์๋๋ก ์ถ์๋๋ฉด PCM์ ์ฌ๊ฐํ ๋ฌธ์ ๊ฐ ๋ ๊ฒ์ผ๋ก ๋ณด๊ณ ๋์๋ค. ๋ฐ๋ผ์, WDE ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ๊ฒ์ PCM์ด DRAM๊ณผ ๋๋ฑํ ๊ฒฝ์๋ ฅ์ ๊ฐ์ถ๋๋ก ํ๋ ๋ฐ ์์ด ํ์์ ์ด๋ค. ์ด ๋ฌธ์ ๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด ์ด ๋
ผ๋ฌธ์์๋ 2-๋ ๋ฒจ SRAM ๊ธฐ๋ฐ ํ
์ด๋ธ์ ํ์ฉํ์ฌ WDE ์๋ฅผ ํฌ๊ฒ ์ค์ฌ ํ์์ ๋ฐ๋ผ ์ค ์์ ์
์ ๋ณต์ํ ์ ์๋ ์๋ก์ด ์ ๊ทผ ๋ฐฉ์์ ์ ์ํ๋ค. ๋ํ, ์๋ SRAM์์ ์๋ฐฑ ๊ฐ์ ์ฝ๊ธฐ ํฌํธ๊ฐ ํ์ํ ๋์ฒด ์ ์ฑ
์ ๊ตฌํํ๊ธฐ ์ํด ์๋ก์ด ๋๋ค ๊ธฐ๋ฐ์ ๊ธฐ๋ฒ์ ์ ์ํ๋ค.
PCM์ ๋ ๋ฒ์งธ ๋ฌธ์ ๋ DRAM์ ๋นํด ์ง์ฐ ์๊ฐ์ด ๊ธธ๋ค๋ ๊ฒ์ด๋ค. ํนํ PCM์ ๋ ํฐ ํธ๋์ญ์
๋จ์๋ฅผ ์ฑํํ์ฌ ๋จ์์๊ฐ ๋น ๋ฐ์ดํฐ ์ฒ๋ฆฌ๋ ํฅ์์ ๋๋ชจํ๋ค. ๊ทธ๋ฌ๋ ๋ฒ์ฉ ํ๋ก์ธ์ ์บ์ ๋ผ์ธ๊ณผ ๋ค๋ฅธ ์ ๋ ํฌ๊ธฐ๋ ์ฝ๊ธฐ-์์ -์ฐ๊ธฐ (RMW) ๋ชจ๋์ ๋์
์ผ๋ก ์ธํด ์์คํ
์ฑ๋ฅ์ ์ ํํ๊ฒ ๋๋ค. PCM ๊ธฐ๋ฐ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์์ RMW ๊ด๋ จ ์ฐ๊ตฌ๊ฐ ์์๊ธฐ ๋๋ฌธ์ ๋ณธ ๋
ผ๋ฌธ์ RMW ๋ชจ๋์ ํ์ฌ ํ PCM ๊ธฐ๋ฐ ๋ฉ๋ชจ๋ฆฌ ์์คํ
์ ์ ๋ฐ์ ์ธ ์์คํ
์ฑ๋ฅ๊ณผ ์ ๋ขฐ์ฑ์ ํฅ์ํ๊ฒ ์ํฌ ์ ์๋ ์๋ก์ด ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค. ์ ์๋ ์ํคํ
์ฒ๋ ์ถ๊ฐ ์คํ ๋ฆฌ์ง ๋ฆฌ์์ค๋ฅผ ๋์
ํ์ง ์๊ณ ๋ ๋ฐ์ดํฐ ์ฌ์ฌ์ฉ์ฑ์ ํฅ์์ํจ๋ค. ๋ํ, ์ฑ๋ฅ ํฅ์์ ์ํด ๋ช
๋ น ์ ํ๊ณผ ๊ด๊ณ์์ด ๋ช
๋ น์ ๋ณํฉํ๋ ์๋ก์ด ์์
์ ์ ์ํ๋ค.
๋ ๋ค๋ฅธ ๋ฌธ์ ๋ PCM์ ์ํ ์์ ํ ์๋ฎฌ๋ ์ด์
ํ๋ซํผ์ด ๋ถ์ฌํ๋ค๋ ๊ฒ์ด๋ค. PCM ๊ด๋ จ ์ ํ(์ : Intel Optane)์ ๋ํด ๋ฐํ๋ ์ ๋ณด๋ ๋์ธ๋น ๋ฌธ์ ๋ก ์ธํด ๋ถ์กฑํ๋ค. ํ์ง๋ง ์๋ ค์ ธ ์๋ ์ ๋ณด๋ฅผ ์ ์ ํ ์ทจํฉํ๋ฉด ์์ค ์ ํ๊ณผ ์ ์ฌํ ์ํคํ
์ฒ ์๋ฎฌ๋ ์ดํฐ๋ฅผ ๊ฐ๋ฐํ ์ ์๋ค. ์ด๋ฅผ ์ํด ๋ณธ ๋
ผ๋ฌธ์ PCM ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ์ ํ์ํ ๋ชจ๋ ๋ชจ๋ ์ ๋ณด๋ฅผ ํ์ฉํ์ฌ ํฅํ ์ด์ ๊ด๋ จ๋ ์ฐ๊ตฌ์์ ์ถฉ๋ถํ ์ฌ์ฉ ๊ฐ๋ฅํ ์ ์ฉ ์๋ฎฌ๋ ์ดํฐ๋ฅผ ๊ตฌํํ์๋ค.1 INTRODUCTION 1
1.1 Limitation of Traditional Main Memory Systems 1
1.2 Phase-Change Memory as Main Memory 3
1.2.1 Opportunities of PCM-based System 3
1.2.2 Challenges of PCM-based System 4
1.3 Dissertation Overview 7
2 BACKGROUND AND PREVIOUS WORK 8
2.1 Phase-Change Memory 8
2.2 Mitigation Schemes for Write Disturbance Errors 10
2.2.1 Write Disturbance Errors 10
2.2.2 Verification and Correction 12
2.2.3 Lazy Correction 13
2.2.4 Data Encoding-based Schemes 14
2.2.5 Sparse-Insertion Write Cache 16
2.3 Performance Enhancement for Read-Modify-Write 17
2.3.1 Traditional Read-Modify-Write 17
2.3.2 Write Coalescing for RMW 19
2.4 Architecture Simulators for PCM 21
2.4.1 NVMain 21
2.4.2 Ramulator 22
2.4.3 DRAMsim3 22
3 IN-MODULE DISTURBANCE BARRIER 24
3.1 Motivation 25
3.2 IMDB: In Module-Disturbance Barrier 29
3.2.1 Architectural Overview 29
3.2.2 Implementation of Data Structures 30
3.2.3 Modification of Media Controller 36
3.3 Replacement Policy 38
3.3.1 Replacement Policy for IMDB 38
3.3.2 Approximate Lowest Number Estimator 40
3.4 Putting All Together: Case Studies 43
3.5 Evaluation 45
3.5.1 Configuration 45
3.5.2 Architectural Exploration 47
3.5.3 Effectiveness of the Replacement Policy 48
3.5.4 Sensitivity to Main Table Configuration 49
3.5.5 Sensitivity to Barrier Buffer Size 51
3.5.6 Sensitivity to AppLE Group Size 52
3.5.7 Comparison with Other Studies 54
3.6 Discussion 59
3.7 Summary 63
4 INTEGRATION OF AN RMW MODULE IN A PCM-BASED SYSTEM 64
4.1 Motivation 65
4.2 Utilization of DRAM Cache for RMW 67
4.2.1 Architectural Design 67
4.2.2 Algorithm 70
4.3 Typeless Command Merging 73
4.3.1 Architectural Design 73
4.3.2 Algorithm 74
4.4 An Alternative Implementation: SRC-RMW 78
4.4.1 Implementation of SRC-RMW 78
4.4.2 Design Constraint 80
4.5 Case Study 82
4.6 Evaluation 85
4.6.1 Configuration 85
4.6.2 Speedup 88
4.6.3 Read Reliability 91
4.6.4 Energy Consumption: Selecting a Proper Page Size 93
4.6.5 Comparison with Other Studies 95
4.7 Discussion 97
4.8 Summary 99
5 AN ALL-INCLUSIVE SIMULATOR FOR A PCM CONTROLLER 100
5.1 Motivation 101
5.2 PCMCsim: PCM Controller Simulator 103
5.2.1 Architectural Overview 103
5.2.2 Underlying Classes of PCMCsim 104
5.2.3 Implementation of Contention Behavior 108
5.2.4 Modules of PCMCsim 109
5.3 Evaluation 116
5.3.1 Correctness of the Simulator 116
5.3.2 Comparison with Other Simulators 117
5.4 Summary 119
6 Conclusion 120
Abstract (In Korean) 141
Acknowledgment 143๋ฐ
Techniques of design optimisation for algorithms implemented in software
The overarching objective of this thesis was to develop tools for parallelising, optimising,
and implementing algorithms on parallel architectures, in particular General Purpose
Graphics Processors (GPGPUs). Two projects were chosen from different application areas
in which GPGPUs are used: a defence application involving image compression, and a
modelling application in bioinformatics (computational immunology). Each project had its
own specific objectives, as well as supporting the overall research goal.
The defence / image compression project was carried out in collaboration with the Jet
Propulsion Laboratories. The specific questions were: to what extent an algorithm designed
for bit-serial for the lossless compression of hyperspectral images on-board unmanned
vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to
implement that algorithm, and whether a software implementation with or without GPGPU
acceleration could match the throughput of a dedicated hardware (FPGA) implementation.
The dependencies within the algorithm were analysed, and the algorithm parallelised. The
algorithm was implemented in software for GPGPU, and optimised. During the optimisation
process, profiling revealed less than optimal device utilisation, but no further optimisations
resulted in an improvement in speed. The design had hit a local-maximum of performance.
Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation
metric of kernel occupancy used for GPU optimisation. Redesigning the implementation
with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new
implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board
implementation of the CCSDS lossless hyperspectral image compression algorithm,
exceeding the performance of the hardware reference implementation, and providing
sufficient throughput for the next generation of image sensor as well.
The second project was carried out in collaboration with biologists at the University of
Arizona and involved modelling a complex biological system โ VDJ recombination involved
in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor
and antibodies) by VDJ recombination is an enormously complex process, which can
theoretically synthesize greater than 1018 variants. Originally thought to be a random
process, the underlying mechanisms clearly have a non-random nature that preferentially
creates a small subset of immune receptors in many individuals. Understanding this bias is a
longstanding problem in the field of immunology. Modelling the process of VDJ
recombination to determine the number of ways each immune receptor can be synthesized,
previously thought to be untenable, is a key first step in determining how this special
population is made. The computational tools developed in this thesis have allowed
immunologists for the first time to comprehensively test and invalidate a longstanding theory
(convergent recombination) for how this special population is created, while generating the
data needed to develop novel hypothesis
์๋ณํ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ ๊ฐ์ญ ์ค๋ฅ ์ํ ๋ฐ RMW ์ฑ๋ฅ ํฅ์ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ,2021. 8. ์ดํ์ฌ.Phase-change memory (PCM) announces the beginning of the new era of memory systems, owing to attractive characteristics. Many memory product manufacturers (e.g., Intel, SK Hynix, and Samsung) are developing related products. PCM can be applied to various circumstances; it is not simply limited to an extra-scale database. For example, PCM has a low standby power due to its non-volatility; hence, computation-intensive applications or mobile applications (i.e., long memory idle time) are suitable to run on PCM-based computing systems.
The second problem of PCM is a long-latency compared to that of DRAM. In particular, PCM tries to enhance its throughput by adopting a larger transaction unit; however, the different unit size from the general-purpose processor cache line further degrades the system performance due to the introduction of a read-modify-write (RMW) module. Since there has never been any research related to RMW in a PCM-based memory system, this dissertation proposes a novel architecture to enhance the overall system performance and reliability of a PCM-based memory system having an RMW module. The proposed architecture enhances data re-usability without introducing extra storage resources. Furthermore, a novel operation that merges commands regardless of command types is proposed to enhance performance notably.Despite these fascinating features of PCM, PCM is still far from the general commercial market due to low reliability and long latency problems. In particular, low reliability is a painful problem for PCM in past decades. As the semiconductor process technology rapidly scales down over the years, DRAM reaches 10 nm class process technology. In addition, it is reported that the write disturbance error (WDE) would be a serious issue for PCM if it scales down below 54 nm class process technology. Therefore, addressing the problem of WDEs becomes essential to make PCM competitive to DRAM. To overcome this problem, this dissertation proposes a novel approach that can restore meta-stable cells on demand by levering two-level SRAM-based tables, thereby significantly reducing the number WDEs. Furthermore, a novel randomized approach is proposed to implement a replacement policy that originally requires hundreds of read ports on SRAM.Another problem is the absence of a full simulation platform for PCM. While the announced features of the PCM-related product (i.e., Intel Optane) are scarce due to confidential issues, all priceless information can be integrated to develop an architecture simulator that resembles the available product. To this end, this dissertation tries to scrape up all available features of modules in a PCM controller and implement a dedicated simulator for future research purposes
Towards Computational Efficiency of Next Generation Multimedia Systems
To address throughput demands of complex applications (like Multimedia), a next-generation system designer needs to co-design and co-optimize the hardware and software layers. Hardware/software knobs must be tuned in synergy to increase the throughput efficiency. This thesis provides such algorithmic and architectural solutions, while considering the new technology challenges (power-cap and memory aging). The goal is to maximize the throughput efficiency, under timing- and hardware-constraints
A configurable vector processor for accelerating speech coding algorithms
The growing demand for voice-over-packer (VoIP) services and multimedia-rich
applications has made increasingly important the efficient, real-time implementation of
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are
designed to substantially reduce the bandwidth requirements thus enabling dense multichannel
gateways in small form factor. This however comes at a high computational cost
which mandates the use of very high performance embedded processors.
This thesis investigates the potential acceleration of two major ITU-T speech coding
algorithms, namely G.729A and G.723.1, through their efficient implementation on a
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs
were introduced which resulted in up to 80% reduction in the dynamic instruction count
of both workloads. These instructions were subsequently encapsulated into a parametric,
hybrid SISD (scalar processor)โSIMD (vector) processor. This work presents the research
and implementation of the vector datapath of this vector coprocessor which is tightly-coupled
to a Sparc-V8 compliant CPU, the optimization and simulation methodologies
employed and the use of Electronic System Level (ESL) techniques to rapidly design
SIMD datapaths
Optimizing AI at the Edge: from network topology design to MCU deployment
The first topic analyzed in the thesis will be Neural Architecture Search (NAS).
I will focus on two different tools that I developed, one to optimize the architecture of Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged, and one to optimize the data precision of tensors inside CNNs.
The first NAS proposed explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive field, and the number of features in each layer. Note that this is the first NAS that explicitly targets these networks.
The second NAS proposed instead focuses on finding the most efficient data format for a target CNN, with the granularity of the layer filter. Note that applying these two NASes in sequence allows an "application designer" to minimize the structure of the neural network employed, minimizing the number of operations or the memory usage of the network.
After that, the second topic described is the optimization of neural network deployment on edge devices. Importantly, exploiting edge platforms' scarce resources is critical for NN efficient execution on MCUs.
To do so, I will introduce DORY (Deployment Oriented to memoRY) -- an automatic tool to deploy CNNs on low-cost MCUs.
DORY, in different steps, can manage different levels of memory inside the MCU automatically, offload the computation workload (i.e., the different layers of a neural network) to dedicated hardware accelerators, and automatically generates ANSI C code that orchestrates off- and on-chip transfers with the computation phases.
On top of this, I will introduce two optimized computation libraries that DORY can exploit to deploy TCNs and Transformers on edge efficiently.
I conclude the thesis with two different applications on bio-signal analysis, i.e., heart rate tracking and sEMG-based gesture recognition
Systematic Design Methods for Efficient Off-Chip DRAM Access
Typical design flows for digital hardware take, as their input, an abstract description
of computation and data transfer between logical memories. No existing commercial
high-level synthesis tool demonstrates the ability to map logical memory inferred from
a high level language to external memory resources. This thesis develops techniques for
doing this, specifically targeting off-chip dynamic memory (DRAM) devices. These are
a commodity technology in widespread use with standardised interfaces. In use, the
bandwidth of an external memory interface and the latency of memory requests asserted
on it may become the bottleneck limiting the performance of a hardware design. Careful
consideration of this is especially important when designing with DRAMs, whose latency
and bandwidth characteristics depend upon the sequence of memory requests issued by
a controller.
Throughout the work presented here, we pursue exact compile-time methods for designing
application-specific memory systems with a focus on guaranteeing predictable performance
through static analysis. This contrasts with much of the surveyed existing work,
which considers general purpose memory controllers and optimized policies which improve
performance in experiments run using simulation of suites of benchmark codes.
The work targets loop-nests within imperative source code, extracting a mathematical
representation of the loop-nest statements and their associated memory accesses, referred
to as the โPolytope Modelโ. We extend this mathematical representation to represent the
physical DRAM โrowโ and โcolumnโ structures accessed when performing memory transfers.
From this augmented representation, we can automatically derive DRAM controllers
which buffer data in on-chip memory and transfer data in an efficient order. Buffering
data and exploiting โreuseโ of data is shown to enable up to 50ร reduction in the quantity
of data transferred to external memory. The reordering of memory transactions exploiting
knowledge of the physical layout of the DRAM device allowing to 4ร improvement in
the efficiency of those data transfers
Software for Exascale Computing - SPPEXA 2016-2019
This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springerโs series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXAโs first funding phase, and provides an overview of SPPEXAโs contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest