37 research outputs found
An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory
This paper proposes a novel intelligent framework for oversubscription
management in CPU-GPU UVM. We analyze the current rule-based methods of GPU
memory oversubscription with unified memory, and the current learning-based
methods for other computer architectural components. We then identify the
performance gap between the existing rule-based methods and the theoretical
upper bound. We also identify the advantages of applying machine intelligence
and the limitations of the existing learning-based methods. This paper proposes
a novel intelligent framework for oversubscription management in CPU-GPU UVM.
It consists of an access pattern classifier followed by a pattern-specific
Transformer-based model using a novel loss function aiming for reducing page
thrashing. A policy engine is designed to leverage the model's result to
perform accurate page prefetching and pre-eviction. We evaluate our intelligent
framework on a set of 11 memory-intensive benchmarks from popular benchmark
suites. Our solution outperforms the state-of-the-art (SOTA) methods for
oversubscription management, reducing the number of pages thrashed by 64.4\%
under 125\% memory oversubscription compared to the baseline, while the SOTA
method reduces the number of pages thrashed by 17.3\%. Our solution achieves an
average IPC improvement of 1.52X under 125\% memory oversubscription, and our
solution achieves an average IPC improvement of 3.66X under 150\% memory
oversubscription. Our solution outperforms the existing learning-based methods
for page address prediction, improving top-1 accuracy by 6.45\% (up to 41.2\%)
on average for a single GPGPU workload, improving top-1 accuracy by 10.2\% (up
to 30.2\%) on average for multiple concurrent GPGPU workloads.Comment: arXiv admin note: text overlap with arXiv:2203.1267
Holistic Performance Analysis and Optimization of Unified Virtual Memory
The programming difficulty of creating GPU-accelerated high performance computing (HPC) codes has been greatly reduced by the advent of Unified Memory technologies that abstract the management of physical memory away from the developer. However, these systems incur substantial overhead that paradoxically grows for codes where these technologies are most useful. While these technologies are increasingly adopted for use in modern HPC frameworks and applications, the performance cost reduces the efficiency of these systems and turns away some developers from adoption entirely. These systems are naturally difficult to optimize due to the large number of interconnected hardware and software components that must be untangled to perform thorough analysis.
In this thesis, we take the first deep dive into a functional implementation of a Unified Memory system, NVIDIA UVM, to evaluate the performance and characteristics of these systems. We show specific hardware and software interactions that cause serialization between host and devices. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. Through lower-level analysis, we find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. These findings indicate that the cost of host OS components is significant and present across UM implementations. We also provide a proof-of-concept asynchronous approach to memory management in UVM that allows for reduced system overhead and improved application performance. This study provides constructive insight into future implementations and systems, such as Heterogeneous Memory Management
Recommended from our members
Architecture Optimizations for Memory Systems of Throughput Processors
Throughput-oriented processors, such as graphics processing units (GPUs), have been increasingly used to accelerate general purpose computing, including machine learning models that are being utilized in numerous disciplines. Thousands of concurrently running threads in a GPU demand a highly efficient memory subsystem for data supply in GPUs. In this dissertation, we have studied the memory architecture of the traditional GPUs and revealed that the traditional memory architecture, initially designed for graphics processing, is less efficient in handling general purpose computing tasks. We propose several memory architecture optimizations for two primary objectives: (1) optimize current memory architecture for more efficient handling of general purpose computing tasks; (2) improve the overall performance of GPUs.
This dissertation has four major parts: (1) The first part deals with the L2 cache inefficiency. A key factor that affects the memory subsystem is the order of memory accesses. While reordering memory accesses at L2 cache has large potential benefits to both cache and DRAM, little work has been conducted to exploit this. In this work, we investigate the largely unexplored opportunity of L2 cache access reordering. We propose Cache Access Reordering Tree (CART), a novel architecture that can improve memory subsystem efficiency by actively reordering memory accesses at L2 cache to be cache-friendly and DRAM-friendly. (2) The second part deals with miss handling architecture (MHA) in GPUs. Conventional MHA is static in sense that it provides a fixed number of MSHR entries to track primary misses, and a fixed number of slots within each entry to track secondary misses. This leads to severe entry or slot under-utilization and poor match to practical workloads, as the number of memory requests to different cache lines can vary significantly. We propose Dynamically Linked MSHR (DL-MSHR), a novel approach that dynamically forms MSHR entries from a pool of available slots. This approach can self-adapt to primary-miss-predominant applications by forming more entries with fewer slots, and self-adapt to secondary-miss-predominant applications by having fewer entries but more slots per entry. (3) The third part aims to improve the performance of Unified Virtual Memory (UVM), which is recently introduced into GPUs. We propose CAPTURE(Capacity-Aware Prefetch with True Usage Reflected Eviction), a novel microarchitecture scheme that implements coordinated prefetch-eviction for GPU UVM management. CAPTURE utilizes GPU memory status and memory access history to dynamically adjust the prefetching and ``capture'' accurate remaining page reusing opportunities for improved eviction. (4) In the fourth part, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on the UVM research
The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework
Computers continue to diversify with respect to system designs, emerging
memory technologies, and application memory demands. Unfortunately, continually
adapting the conventional virtual memory framework to each possible system
configuration is challenging, and often results in performance loss or requires
non-trivial workarounds. To address these challenges, we propose a new virtual
memory framework, the Virtual Block Interface (VBI). We design VBI based on the
key idea that delegating memory management duties to hardware can reduce the
overheads and software complexity associated with virtual memory. VBI
introduces a set of variable-sized virtual blocks (VBs) to applications. Each
VB is a contiguous region of the globally-visible VBI address space, and an
application can allocate each semantically meaningful unit of information
(e.g., a data structure) in a separate VB. VBI decouples access protection from
memory allocation and address translation. While the OS controls which programs
have access to which VBs, dedicated hardware in the memory controller manages
the physical memory allocation and address translation of the VBs. This
approach enables several architectural optimizations to (1) efficiently and
flexibly cater to different and increasingly diverse system configurations, and
(2) eliminate key inefficiencies of conventional virtual memory. We demonstrate
the benefits of VBI with two important use cases: (1) reducing the overheads of
address translation (for both native execution and virtual machine
environments), as VBI reduces the number of translation requests and associated
memory accesses; and (2) two heterogeneous main memory architectures, where VBI
increases the effectiveness of managing fast memory regions. For both cases,
VBI significanttly improves performance over conventional virtual memory
Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems
High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex.
This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format.
Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations
GPU accelerated path tracing of massive scenes
This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across up to 16 CPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs.
We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer.
The approach was validated on scenes of sizes up to 169 GB. We show that only 1 5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.Web of Science402art. no. 1
λμμ μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ λ€μ μν λ³λ ¬μ± κ΄λ¦¬
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications.
In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance.
To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters.
Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.λ©ν°μ½μ΄ μμ€ν
μμ μ¬λ¬ κ°μ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ ν¨κ» μ€νμν€λ κ² μ μ£Όμ΄μ§ νλμ¨μ΄ μμμ ν¨μ¨μ μΌλ‘ μ¬μ©νκΈ° μν΄μ μ μ λ μ€μν΄μ§κ³ μλ€. νμ§λ§, νμ¬ λ°νμ μμ€ν
μμ μ¬λ¬ κ°μ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ λμμ ν¨μ¨μ μΌλ‘ μ€νμν€λ κ²μ μ¬μ ν μ΄λ €μ΄ λ¬Έμ μ΄λ€. OpenMPμ κ°μ΄ ν΅μ μ¬ μ©λλ λ³λ ¬ν λ°νμ μμ€ν
λ€μ λͺ¨λ νλμ¨μ΄ μ½μ΄ μμμ μ¬μ©νκΈ° μν΄μ μΌλ°μ μΌλ‘ μ½μ΄ κ°μ λ§νΌ μ€λ λλ₯Ό μμ±νμ¬ μ΄ν리μΌμ΄μ
μ μ€νμν¨λ€. μ΄ λ, μ΄ν리μΌμ΄μ
μ λͺ¨λ μ½μ΄ μμμ νμ©ν λ μ€νλ € μ΅μ μ μ±λ₯μ μ»μ§ λͺ»ν μλ μμΌλ©°, μ΄μ체μ 컀λμ λΆνλ μ€νλλ μ΄ν리μΌμ΄μ
μ κ°μκ° λμ΄λ μλ‘ κ΄λ¦¬ν΄μΌ νλ μ€λ λμ κ°μκ° λμ΄λκΈ° λλ¬Έμ κ³μν΄μ 컀μ§κ² λλ€.
λ³Έ νμ λ
Όλ¬Έμμ, μ°λ¦¬λ ν¨κ» μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ λ°νμ μ±λ₯μ λμ΄λ κ²μ μ§μ€νλ€. μ΄λ₯Ό μν΄, λ³Έ μ°κ΅¬μ ν΅μ¬ λͺ©νλ ν¨κ» μ€νλλ μ΄ν리μΌμ΄μ
λ€μκ² κ³΅κ° λΆν μ μ€μΌμ€λ§ λ°©λ²μ μ μ©νλ κ²μ΄λ€. κ° μ΄ν리 μΌμ΄μ
μκ² λ
립μ μΈ μ½μ΄ μμμ ν λΉν΄μ£Όλ κ³΅κ° λΆν μ μ€μΌμ€λ§μ μ μ λ λμ΄λλ μ½μ΄ μμμ κ°μλ₯Ό ν¨μ¨μ μΌλ‘ κ΄λ¦¬νκΈ° μν λ°©λ²μΌλ‘ λ§μ κ΄μ¬μ λ°κ³ μλ€. νμ§λ§, κ³΅κ° λΆν μ€μΌμ€λ§ λ°©λ²μ ν΅ν΄ μ΄ν리μΌμ΄μ
μ μ€νμν€λ κ²μ λ κ°μ§ μ°κ΅¬ κ³Όμ λ₯Ό κ°μ§κ³ μλ€. λ¨Όμ , κ° μ΄ν리μΌμ΄μ
μ κ°λ³μ μΈ μ½μ΄ μμ μμμ ν¨μ¨μ μΌλ‘ μ€νλκΈ° μν λ°νμ κΈ°μ μ νμλ‘ νκ³ , μ€μΌμ€λ¬λ μ΄ν리μΌμ΄μ
λ€μ μ±λ₯ νΉμ±μ κ³ λ €ν΄μ λ°νμ μ±λ₯μ λμΌ μ μλλ‘ μ λΉν μμ μ½μ΄ μμμ μ 곡ν΄μΌνλ€.
μ΄ νμ λ
Όλ¬Έμμ, μ°λ¦¬λ ν¨κ» μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ κ³΅κ° λΆ ν μ€μΌμ€λ§μ ν΅ν΄μ ν¨μ¨μ μΌλ‘ μ€νμν€κΈ° μν μΈκ°μ§ λ°νμ μμ€ν
κΈ°μ μ μκ°νλ€. λ¨Όμ μ°λ¦¬λ νλμ μΈ λ°νμ μμ€ν
μ΄λΌλ κΈ°μ μ μκ°νλλ°, μ΄λ OpenMP λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μκ² μ μ°νκ³ ν¨μ¨μ μΈ μ€ν νκ²½μ μ 곡νλ€. μ΄ κΈ°μ μ 곡μ λ©λͺ¨λ¦¬ λ³λ ¬ μ€νμ λ΄μ¬λμ΄ μλ νΉμ±μ νμ©νμ¬ λ³λ ¬μ²λ¦¬ νλ‘κ·Έλ¨λ€μ΄ λ³ννλ μ½μ΄ μμμ λ§μΆμ΄ λ³λ ¬μ±μ μ λλ₯Ό λμ μΌλ‘ μ‘°μ ν μ μλλ‘ ν΄μ€λ€. μ΄λ¬ν μ μ°ν μ€ν λͺ¨λΈμ λ³λ ¬ μ΄ν리μΌμ΄μ
λ€μ΄ μ¬μ© κ°λ₯ν μ½μ΄ μμμ΄ λμ μΌλ‘ λ³ννλ νκ²½μμ μ΄ν리μΌμ΄μ
μ μ€λ λ μμ€ λ³λ ¬μ±μ λ€λ£¨μ§ λͺ»νλ κΈ°μ‘΄ λ°νμ μμ€ν
λ€μ λΉν΄μ λ ν¨μ¨μ μΌλ‘ μ€νλ μ μλλ‘ ν΄μ€λ€.
λλ²μ§Έλ‘, λ³Έ λ
Όλ¬Έμ μ¬μ©λλ μ½μ΄ μμμ λ°λ₯Έ λ³λ ¬μ²λ¦¬ νλ‘κ·Έλ¨μ μ±λ₯ λ° μμ νμ©λλ₯Ό μμΈ‘ν μ μλλ‘ ν΄μ£Όλ λΆμμ μ±λ₯ λͺ¨λΈμ μκ°νλ€. λ³λ ¬ μ²λ¦¬ μ½λμ μ±λ₯ νμ₯μ±μ΄ μΌλ°μ μΌλ‘ λ©λͺ¨λ¦¬ μ±λ₯μ μ’μ°λλ€λ κ΄μ°°μ κΈ°μ΄νμ¬, μ μλ ν΄μ λͺ¨λΈμ νμ μ΄λ‘ μ νμ©νμ¬ λ©λͺ¨λ¦¬ μμ€ν
μ μ±λ₯ μ 보λ€μ κ³μ°νλ€. μ΄ νμ μμ€ν
μ κΈ°λ°ν λ°©λ²μ μ μ©ν μ±λ₯ μ 보λ€μ μμμ ν΅ν΄ ν¨μ¨μ μΌλ‘ κ³μ°ν μ μλλ‘ νλ©° μμ© μμ€ν
μμ μ 곡νλ νλμ¨μ΄ μ±λ₯ μΉ΄μ΄ν°λ§μ μꡬ νκΈ° λλ¬Έμ νμ© κ°λ₯μ± λν λλ€.
λ§μ§λ§μΌλ‘, λ³Έ λ
Όλ¬Έμ λμμ μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€ μ¬μ΄μμ μ½μ΄ μμμ ν λΉν΄μ£Όλ νλ μμν¬λ₯Ό μκ°νλ€. μ μλ νλ μμν¬λ λμμ λ μνλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
μ λ³λ ¬μ± λ° μ½μ΄ μμμ κ΄λ¦¬νμ¬ λ©ν° μμΌ λ©ν°μ½μ΄ μμ€ν
μμ CPU μμ λ° λ©λͺ¨λ¦¬ λμν μμ νμ©λλ₯Ό λμμ μ΅μ ννλ€. ν΄μμ μΈ λͺ¨λΈλ§κ³Ό μ μλ μ½μ΄ ν λΉ νλ μμν¬μ μ±λ₯ νκ°λ₯Ό ν΅ν΄μ, μ°λ¦¬κ° μ μνλ μ μ±
μ΄ μΌλ°μ μΈ κ²½μ°μ CPU μμμ νμ©λλ§μ μ΅μ ννλ λ°©λ²μ λΉν΄μ ν¨κ» λμνλ μ΄ν리μΌμ΄μ
λ€μ μ€νμκ°μ κ°μμν¬ μ μμμ 보μ¬μ€λ€.1 Introduction 1
1.1 Motivation 1
1.2 Background 5
1.2.1 The OpenMP Runtime System 5
1.2.2 Target Multi-Socket Multicore Systems 7
1.3 Contributions 8
1.3.1 Cooperative Runtime Systems 9
1.3.2 Performance Modeling 9
1.3.3 Parallelism Management 10
1.4 Related Work 11
1.4.1 Cooperative Runtime Systems 11
1.4.2 Performance Modeling 12
1.4.3 Parallelism Management 14
1.5 Organization of this Thesis 15
2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17
2.1 Overview 17
2.2 Malleable Workloads 19
2.3 Cooperative OpenMP Runtime System 21
2.3.1 Cooperative User-Level Tasking 22
2.3.2 Cooperative Dynamic Loop Scheduling 27
2.4 Experimental Results 30
2.4.1 Standalone Application Performance 30
2.4.2 Performance in Spatial Core Allocation 33
2.5 Discussion 35
2.5.1 Contributions 35
2.5.2 Limitations and Future Work 36
2.5.3 Summary 37
3 Performance Modeling of Parallel Loops using Queueing Systems 38
3.1 Overview 38
3.2 Background 41
3.2.1 Queueing Models 41
3.2.2 Insights on Performance Modeling of Parallel Loops 43
3.2.3 Performance Analysis 46
3.3 Queueing Systems for Multi-Socket Multicores 54
3.3.1 Hierarchical Queueing Systems 54
3.3.2 Computingthe Parameter Values 60
3.4 The Speedup Prediction Model 63
3.4.1 The Speedup Model 63
3.4.2 Implementation 64
3.5 Evaluation 65
3.5.1 64-core AMD Opteron Platform 66
3.5.2 72-core Intel Xeon Platform 68
3.6 Discussion 70
3.6.1 Applicability of the Model 70
3.6.2 Limitations of the Model 72
3.6.3 Summary 73
4 Maximizing System Utilization via Parallelism Management 74
4.1 Overview 74
4.2 Background 76
4.2.1 Modeling Performance Metrics 76
4.2.2 Our Resource Management Policy 79
4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82
4.3.1 Online Performance Model 82
4.3.2 Managing Parallelism 86
4.4 Evaluation of NuPoCo 90
4.4.1 Evaluation Scenario 1 90
4.4.2 Evaluation Scenario 2 98
4.5 MOCA: An Evolutionary Approach to Core Allocation 103
4.5.1 Evolutionary Core Allocation 104
4.5.2 Model-Based Allocation 106
4.6 Evaluation of MOCA 113
4.7 Discussion 118
4.7.1 Contributions and Limitations 118
4.7.2 Summary 119
5 Conclusion and Future Work 120
5.1 Conclusion 120
5.2 Future work 122
5.2.1 Improving Multi-Objective Core Allocation 122
5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123
A Additional Experiments for the Performance Model 124
A.1 Memory Access Distribution and Poisson Distribution 124
A.1.1 Memory Access Distribution 124
A.1.2 Kolmogorov Smirnov Test 127
A.2 Additional Performance Modeling Results 134
A.2.1 Results with Intel Hyperthreading 134
A.2.2 Results with Cooperative User-Level Tasking 134
A.2.3 Results with Other Loop Schedulers 138
A.2.4 Results with Different Number of Memory Nodes 138
B Other Research Contributions of the Author 141
B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141
B.2 Modeling NUMA Architectures with Stochastic Tool 143
B.3 Runtime Environment for a Manycore Architecture 143
μ΄λ‘ 159
Acknowledgements 161Docto