4,785 research outputs found
Design and resource management of reconfigurable multiprocessors for data-parallel applications
FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems.
This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing.
The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconο¬gurable Array (CGRA) architectures accelerate the same inner loops that beneο¬t from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efο¬ciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ο¬exibility, performance, and power-efο¬ciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ο¬ne-tuning of source code
The hArtes Tool Chain
This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform
Supernode Transformation On Parallel Systems With Distributed Memory β An Analytical Approach
Supernode transformation, or tiling, is a technique that partitions algorithms to improve data locality and parallelism by balancing computation and inter-processor communication costs to achieve shortest execution or running time. It groups multiple iterations of nested loops into supernodes to be assigned to processors for processing in parallel. A supernode transformation can be described by supernode size and shape. This research focuses on supernode transformation on multi-processor architectures with distributed memory, including computer cluster systems and General Purpose Graphic Processing Units (GPGPUs). The research involves supernode scheduling, supernode mapping to processors, and the finding of the optimal supernode size, for achieving the shortest total running time. The algorithms considered are two nested loops with regular data dependencies. The Longest Common Subsequence problem is used as an illustration. A novel mathematical model for the total running time is established as a function of the supernode size, algorithm parameters such as the problem size and the data dependence, the computation time of each loop iteration, architecture parameters such as the number of processors, and the communication cost. The optimal supernode size is derived from this closed form model. The model and the optimal supernode size provide better results than previous researches and are verified by simulations on multi-processor systems including computer cluster systems and GPGPUs
λμμ μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ λ€μ μν λ³λ ¬μ± κ΄λ¦¬
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications.
In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance.
To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters.
Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.λ©ν°μ½μ΄ μμ€ν
μμ μ¬λ¬ κ°μ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ ν¨κ» μ€νμν€λ κ² μ μ£Όμ΄μ§ νλμ¨μ΄ μμμ ν¨μ¨μ μΌλ‘ μ¬μ©νκΈ° μν΄μ μ μ λ μ€μν΄μ§κ³ μλ€. νμ§λ§, νμ¬ λ°νμ μμ€ν
μμ μ¬λ¬ κ°μ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ λμμ ν¨μ¨μ μΌλ‘ μ€νμν€λ κ²μ μ¬μ ν μ΄λ €μ΄ λ¬Έμ μ΄λ€. OpenMPμ κ°μ΄ ν΅μ μ¬ μ©λλ λ³λ ¬ν λ°νμ μμ€ν
λ€μ λͺ¨λ νλμ¨μ΄ μ½μ΄ μμμ μ¬μ©νκΈ° μν΄μ μΌλ°μ μΌλ‘ μ½μ΄ κ°μ λ§νΌ μ€λ λλ₯Ό μμ±νμ¬ μ΄ν리μΌμ΄μ
μ μ€νμν¨λ€. μ΄ λ, μ΄ν리μΌμ΄μ
μ λͺ¨λ μ½μ΄ μμμ νμ©ν λ μ€νλ € μ΅μ μ μ±λ₯μ μ»μ§ λͺ»ν μλ μμΌλ©°, μ΄μ체μ 컀λμ λΆνλ μ€νλλ μ΄ν리μΌμ΄μ
μ κ°μκ° λμ΄λ μλ‘ κ΄λ¦¬ν΄μΌ νλ μ€λ λμ κ°μκ° λμ΄λκΈ° λλ¬Έμ κ³μν΄μ 컀μ§κ² λλ€.
λ³Έ νμ λ
Όλ¬Έμμ, μ°λ¦¬λ ν¨κ» μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ λ°νμ μ±λ₯μ λμ΄λ κ²μ μ§μ€νλ€. μ΄λ₯Ό μν΄, λ³Έ μ°κ΅¬μ ν΅μ¬ λͺ©νλ ν¨κ» μ€νλλ μ΄ν리μΌμ΄μ
λ€μκ² κ³΅κ° λΆν μ μ€μΌμ€λ§ λ°©λ²μ μ μ©νλ κ²μ΄λ€. κ° μ΄ν리 μΌμ΄μ
μκ² λ
립μ μΈ μ½μ΄ μμμ ν λΉν΄μ£Όλ κ³΅κ° λΆν μ μ€μΌμ€λ§μ μ μ λ λμ΄λλ μ½μ΄ μμμ κ°μλ₯Ό ν¨μ¨μ μΌλ‘ κ΄λ¦¬νκΈ° μν λ°©λ²μΌλ‘ λ§μ κ΄μ¬μ λ°κ³ μλ€. νμ§λ§, κ³΅κ° λΆν μ€μΌμ€λ§ λ°©λ²μ ν΅ν΄ μ΄ν리μΌμ΄μ
μ μ€νμν€λ κ²μ λ κ°μ§ μ°κ΅¬ κ³Όμ λ₯Ό κ°μ§κ³ μλ€. λ¨Όμ , κ° μ΄ν리μΌμ΄μ
μ κ°λ³μ μΈ μ½μ΄ μμ μμμ ν¨μ¨μ μΌλ‘ μ€νλκΈ° μν λ°νμ κΈ°μ μ νμλ‘ νκ³ , μ€μΌμ€λ¬λ μ΄ν리μΌμ΄μ
λ€μ μ±λ₯ νΉμ±μ κ³ λ €ν΄μ λ°νμ μ±λ₯μ λμΌ μ μλλ‘ μ λΉν μμ μ½μ΄ μμμ μ 곡ν΄μΌνλ€.
μ΄ νμ λ
Όλ¬Έμμ, μ°λ¦¬λ ν¨κ» μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μ κ³΅κ° λΆ ν μ€μΌμ€λ§μ ν΅ν΄μ ν¨μ¨μ μΌλ‘ μ€νμν€κΈ° μν μΈκ°μ§ λ°νμ μμ€ν
κΈ°μ μ μκ°νλ€. λ¨Όμ μ°λ¦¬λ νλμ μΈ λ°νμ μμ€ν
μ΄λΌλ κΈ°μ μ μκ°νλλ°, μ΄λ OpenMP λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€μκ² μ μ°νκ³ ν¨μ¨μ μΈ μ€ν νκ²½μ μ 곡νλ€. μ΄ κΈ°μ μ 곡μ λ©λͺ¨λ¦¬ λ³λ ¬ μ€νμ λ΄μ¬λμ΄ μλ νΉμ±μ νμ©νμ¬ λ³λ ¬μ²λ¦¬ νλ‘κ·Έλ¨λ€μ΄ λ³ννλ μ½μ΄ μμμ λ§μΆμ΄ λ³λ ¬μ±μ μ λλ₯Ό λμ μΌλ‘ μ‘°μ ν μ μλλ‘ ν΄μ€λ€. μ΄λ¬ν μ μ°ν μ€ν λͺ¨λΈμ λ³λ ¬ μ΄ν리μΌμ΄μ
λ€μ΄ μ¬μ© κ°λ₯ν μ½μ΄ μμμ΄ λμ μΌλ‘ λ³ννλ νκ²½μμ μ΄ν리μΌμ΄μ
μ μ€λ λ μμ€ λ³λ ¬μ±μ λ€λ£¨μ§ λͺ»νλ κΈ°μ‘΄ λ°νμ μμ€ν
λ€μ λΉν΄μ λ ν¨μ¨μ μΌλ‘ μ€νλ μ μλλ‘ ν΄μ€λ€.
λλ²μ§Έλ‘, λ³Έ λ
Όλ¬Έμ μ¬μ©λλ μ½μ΄ μμμ λ°λ₯Έ λ³λ ¬μ²λ¦¬ νλ‘κ·Έλ¨μ μ±λ₯ λ° μμ νμ©λλ₯Ό μμΈ‘ν μ μλλ‘ ν΄μ£Όλ λΆμμ μ±λ₯ λͺ¨λΈμ μκ°νλ€. λ³λ ¬ μ²λ¦¬ μ½λμ μ±λ₯ νμ₯μ±μ΄ μΌλ°μ μΌλ‘ λ©λͺ¨λ¦¬ μ±λ₯μ μ’μ°λλ€λ κ΄μ°°μ κΈ°μ΄νμ¬, μ μλ ν΄μ λͺ¨λΈμ νμ μ΄λ‘ μ νμ©νμ¬ λ©λͺ¨λ¦¬ μμ€ν
μ μ±λ₯ μ 보λ€μ κ³μ°νλ€. μ΄ νμ μμ€ν
μ κΈ°λ°ν λ°©λ²μ μ μ©ν μ±λ₯ μ 보λ€μ μμμ ν΅ν΄ ν¨μ¨μ μΌλ‘ κ³μ°ν μ μλλ‘ νλ©° μμ© μμ€ν
μμ μ 곡νλ νλμ¨μ΄ μ±λ₯ μΉ΄μ΄ν°λ§μ μꡬ νκΈ° λλ¬Έμ νμ© κ°λ₯μ± λν λλ€.
λ§μ§λ§μΌλ‘, λ³Έ λ
Όλ¬Έμ λμμ μ€νλλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
λ€ μ¬μ΄μμ μ½μ΄ μμμ ν λΉν΄μ£Όλ νλ μμν¬λ₯Ό μκ°νλ€. μ μλ νλ μμν¬λ λμμ λ μνλ λ³λ ¬ μ²λ¦¬ μ΄ν리μΌμ΄μ
μ λ³λ ¬μ± λ° μ½μ΄ μμμ κ΄λ¦¬νμ¬ λ©ν° μμΌ λ©ν°μ½μ΄ μμ€ν
μμ CPU μμ λ° λ©λͺ¨λ¦¬ λμν μμ νμ©λλ₯Ό λμμ μ΅μ ννλ€. ν΄μμ μΈ λͺ¨λΈλ§κ³Ό μ μλ μ½μ΄ ν λΉ νλ μμν¬μ μ±λ₯ νκ°λ₯Ό ν΅ν΄μ, μ°λ¦¬κ° μ μνλ μ μ±
μ΄ μΌλ°μ μΈ κ²½μ°μ CPU μμμ νμ©λλ§μ μ΅μ ννλ λ°©λ²μ λΉν΄μ ν¨κ» λμνλ μ΄ν리μΌμ΄μ
λ€μ μ€νμκ°μ κ°μμν¬ μ μμμ 보μ¬μ€λ€.1 Introduction 1
1.1 Motivation 1
1.2 Background 5
1.2.1 The OpenMP Runtime System 5
1.2.2 Target Multi-Socket Multicore Systems 7
1.3 Contributions 8
1.3.1 Cooperative Runtime Systems 9
1.3.2 Performance Modeling 9
1.3.3 Parallelism Management 10
1.4 Related Work 11
1.4.1 Cooperative Runtime Systems 11
1.4.2 Performance Modeling 12
1.4.3 Parallelism Management 14
1.5 Organization of this Thesis 15
2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17
2.1 Overview 17
2.2 Malleable Workloads 19
2.3 Cooperative OpenMP Runtime System 21
2.3.1 Cooperative User-Level Tasking 22
2.3.2 Cooperative Dynamic Loop Scheduling 27
2.4 Experimental Results 30
2.4.1 Standalone Application Performance 30
2.4.2 Performance in Spatial Core Allocation 33
2.5 Discussion 35
2.5.1 Contributions 35
2.5.2 Limitations and Future Work 36
2.5.3 Summary 37
3 Performance Modeling of Parallel Loops using Queueing Systems 38
3.1 Overview 38
3.2 Background 41
3.2.1 Queueing Models 41
3.2.2 Insights on Performance Modeling of Parallel Loops 43
3.2.3 Performance Analysis 46
3.3 Queueing Systems for Multi-Socket Multicores 54
3.3.1 Hierarchical Queueing Systems 54
3.3.2 Computingthe Parameter Values 60
3.4 The Speedup Prediction Model 63
3.4.1 The Speedup Model 63
3.4.2 Implementation 64
3.5 Evaluation 65
3.5.1 64-core AMD Opteron Platform 66
3.5.2 72-core Intel Xeon Platform 68
3.6 Discussion 70
3.6.1 Applicability of the Model 70
3.6.2 Limitations of the Model 72
3.6.3 Summary 73
4 Maximizing System Utilization via Parallelism Management 74
4.1 Overview 74
4.2 Background 76
4.2.1 Modeling Performance Metrics 76
4.2.2 Our Resource Management Policy 79
4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82
4.3.1 Online Performance Model 82
4.3.2 Managing Parallelism 86
4.4 Evaluation of NuPoCo 90
4.4.1 Evaluation Scenario 1 90
4.4.2 Evaluation Scenario 2 98
4.5 MOCA: An Evolutionary Approach to Core Allocation 103
4.5.1 Evolutionary Core Allocation 104
4.5.2 Model-Based Allocation 106
4.6 Evaluation of MOCA 113
4.7 Discussion 118
4.7.1 Contributions and Limitations 118
4.7.2 Summary 119
5 Conclusion and Future Work 120
5.1 Conclusion 120
5.2 Future work 122
5.2.1 Improving Multi-Objective Core Allocation 122
5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123
A Additional Experiments for the Performance Model 124
A.1 Memory Access Distribution and Poisson Distribution 124
A.1.1 Memory Access Distribution 124
A.1.2 Kolmogorov Smirnov Test 127
A.2 Additional Performance Modeling Results 134
A.2.1 Results with Intel Hyperthreading 134
A.2.2 Results with Cooperative User-Level Tasking 134
A.2.3 Results with Other Loop Schedulers 138
A.2.4 Results with Different Number of Memory Nodes 138
B Other Research Contributions of the Author 141
B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141
B.2 Modeling NUMA Architectures with Stochastic Tool 143
B.3 Runtime Environment for a Manycore Architecture 143
μ΄λ‘ 159
Acknowledgements 161Docto
Selective Vectorization for Short-Vector Instructions
Multimedia extensions are nearly ubiquitous in today's general-purpose processors. These extensions consist primarily of a set of short-vector instructions that apply the same opcode to a vector of operands. Vector instructions introduce a data-parallel component to processors that exploit instruction-level parallelism, and present an opportunity for increased performance. In fact, ignoring a processor's vector opcodes can leave a significant portion of the available resources unused. In order for software developers to find short-vector instructions generally useful, however, the compiler must target these extensions with complete transparency and consistent performance. This paper describes selective vectorization, a technique for balancing computation across a processor's scalar and vector units. Current approaches for targeting short-vector instructions directly adopt vectorizing technology first developed for supercomputers. Traditional vectorization, however, can lead to a performance degradation since it fails to account for a processor's scalar resources. We formulate selective vectorization in the context of software pipelining. Our approach creates software pipelines with shorter initiation intervals, and therefore, higher performance. A key aspect of selective vectorization is its ability to manage transfer of operands between vector and scalar instructions. Even when operand transfer is expensive, our technique is sufficiently sophisticated to achieve significant performance gains. We evaluate selective vectorization on a set of SPEC FP benchmarks. On a realistic VLIW processor model, the approach achieves whole-program speedups of up to 1.35x over existing approaches. For individual loops, it provides speedups of up to 1.75x
- β¦