287 research outputs found

    An Efficient OpenMP Runtime System for Hierarchical Arch

    Get PDF
    Exploiting the full computational power of always deeper hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. The emergence of multi-core chips and NUMA machines makes it important to minimize the number of remote memory accesses, to favor cache affinities, and to guarantee fast completion of synchronization steps. By using the BubbleSched platform as a threading backend for the GOMP OpenMP compiler, we are able to easily transpose affinities of thread teams into scheduling hints using abstractions called bubbles. We then propose a scheduling strategy suited to nested OpenMP parallelism. The resulting preliminary performance evaluations show an important improvement of the speedup on a typical NAS OpenMP benchmark application

    An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

    Full text link
    We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

    Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures

    Get PDF
    Feltor is a modular and free scientific software package. It allows developing platform independent code that runs on a variety of parallel computer architectures ranging from laptop CPUs to multi-GPU distributed memory systems. Feltor consists of both a numerical library and a collection of application codes built on top of the library. Its main target are two- and three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin methods as the main numerical discretization technique. We observe that numerical simulations of a recently developed gyro-fluid model produce non-deterministic results in parallel computations. First, we show how we restore accuracy and bitwise reproducibility algorithmically and programmatically. In particular, we adopt an implementation of the exactly rounded dot product based on long accumulators, which avoids accuracy losses especially in parallel applications. However, reproducibility and accuracy alone fail to indicate correct simulation behaviour. In fact, in the physical model slightly different initial conditions lead to vastly different end states. This behaviour translates to its numerical representation. Pointwise convergence, even in principle, becomes impossible for long simulation times. In a second part, we explore important performance tuning considerations. We identify latency and memory bandwidth as the main performance indicators of our routines. Based on these, we propose a parallel performance model that predicts the execution time of algorithms implemented in Feltor and test our model on a selection of parallel hardware architectures. We are able to predict the execution time with a relative error of less than 25% for problem sizes between 0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth gives a minimum array size per compute node to achieve a scaling efficiency above 50% (both strong and weak)

    Architecture aware parallel programming in Glasgow parallel Haskell (GPH)

    Get PDF
    General purpose computing architectures are evolving quickly to become manycore and hierarchical: i.e. a core can communicate more quickly locally than globally. To be effective on such architectures, programming models must be aware of the communications hierarchy. This thesis investigates a programming model that aims to share the responsibility of task placement, load balance, thread creation, and synchronisation between the application developer and the runtime system. The main contribution of this thesis is the development of four new architectureaware constructs for Glasgow parallel Haskell that exploit information about task size and aim to reduce communication for small tasks, preserve data locality, or to distribute large units of work. We define a semantics for the constructs that specifies the sets of PEs that each construct identifies, and we check four properties of the semantics using QuickCheck. We report a preliminary investigation of architecture aware programming models that abstract over the new constructs. In particular, we propose architecture aware evaluation strategies and skeletons. We investigate three common paradigms, such as data parallelism, divide-and-conquer and nested parallelism, on hierarchical architectures with up to 224 cores. The results show that the architecture-aware programming model consistently delivers better speedup and scalability than existing constructs, together with a dramatic reduction in the execution time variability. We present a comparison of functional multicore technologies and it reports some of the first ever multicore results for the Feedback Directed Implicit Parallelism (FDIP) and the semi-explicit parallelism (GpH and Eden) languages. The comparison reflects the growing maturity of the field by systematically evaluating four parallel Haskell implementations on a common multicore architecture. The comparison contrasts the programming effort each language requires with the parallel performance delivered. We investigate the minimum thread granularity required to achieve satisfactory performance for three implementations parallel functional language on a multicore platform. The results show that GHC-GUM requires a larger thread granularity than Eden and GHC-SMP. The thread granularity rises as the number of cores rises

    Generalizing Hierarchical Parallelism

    Full text link
    Since the days of OpenMP 1.0 computer hardware has become more complex, typically by specializing compute units for coarse- and fine-grained parallelism in incrementally deeper hierarchies of parallelism. Newer versions of OpenMP reacted by introducing new mechanisms for querying or controlling its individual levels, each time adding another concept such as places, teams, and progress groups. In this paper we propose going back to the roots of OpenMP in the form of nested parallelism for a simpler model and more flexible handling of arbitrary deep hardware hierarchies.Comment: IWOMP'23 preprin

    Controllers: an abstraction to ease the use of hardware accelerators

    Get PDF
    ProducciΓ³n CientΓ­ficaNowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.2019-01-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    λ™μ‹œμ— μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ μœ„ν•œ 병렬성 관리

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·컴퓨터곡학뢀, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.λ©€ν‹°μ½”μ–΄ μ‹œμŠ€ν…œμ—μ„œ μ—¬λŸ¬ 개의 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ ν•¨κ»˜ μ‹€ν–‰μ‹œν‚€λŠ” 것 은 주어진 ν•˜λ“œμ›¨μ–΄ μžμ›μ„ 효율적으둜 μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œ 점점 더 μ€‘μš”ν•΄μ§€κ³  μžˆλ‹€. ν•˜μ§€λ§Œ, ν˜„μž¬ λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ—μ„œ μ—¬λŸ¬ 개의 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ λ™μ‹œμ— 효율적으둜 μ‹€ν–‰μ‹œν‚€λŠ” 것은 μ—¬μ „νžˆ μ–΄λ €μš΄ λ¬Έμ œμ΄λ‹€. OpenMP와 같이 톡상 사 μš©λ˜λŠ” 병렬화 λŸ°νƒ€μž„ μ‹œμŠ€ν…œλ“€μ€ λͺ¨λ“  ν•˜λ“œμ›¨μ–΄ μ½”μ–΄ μžμ›μ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œ 일반적으둜 μ½”μ–΄ 개수 만큼 μŠ€λ ˆλ“œλ₯Ό μƒμ„±ν•˜μ—¬ μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ„ μ‹€ν–‰μ‹œν‚¨λ‹€. 이 λ•Œ, μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ€ λͺ¨λ“  μ½”μ–΄ μžμ›μ„ ν™œμš©ν•  λ•Œ 였히렀 졜적의 μ„±λŠ₯을 얻지 λͺ»ν•  μˆ˜λ„ 있으며, 운영체제 μ»€λ„μ˜ λΆ€ν•˜λŠ” μ‹€ν–‰λ˜λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ κ°œμˆ˜κ°€ λŠ˜μ–΄λ‚  수둝 관리해야 ν•˜λŠ” μŠ€λ ˆλ“œμ˜ κ°œμˆ˜κ°€ λŠ˜μ–΄λ‚˜κΈ° λ•Œλ¬Έμ— κ³„μ†ν•΄μ„œ μ»€μ§€κ²Œ λœλ‹€. λ³Έ ν•™μœ„ λ…Όλ¬Έμ—μ„œ, μš°λ¦¬λŠ” ν•¨κ»˜ μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ˜ λŸ°νƒ€μž„ μ„±λŠ₯을 λ†’μ΄λŠ” 것에 μ§‘μ€‘ν•œλ‹€. 이λ₯Ό μœ„ν•΄, λ³Έ μ—°κ΅¬μ˜ 핡심 λͺ©ν‘œλŠ” ν•¨κ»˜ μ‹€ν–‰λ˜λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ—κ²Œ 곡간 뢄할식 μŠ€μΌ€μ€„λ§ 방법을 μ μš©ν•˜λŠ” 것이닀. 각 μ–΄ν”Œλ¦¬ μΌ€μ΄μ…˜μ—κ²Œ 독립적인 μ½”μ–΄ μžμ›μ„ ν• λ‹Ήν•΄μ£ΌλŠ” 곡간 뢄할식 μŠ€μΌ€μ€„λ§μ€ 점점 더 λŠ˜μ–΄λ‚˜λŠ” μ½”μ–΄ μžμ›μ˜ 개수λ₯Ό 효율적으둜 κ΄€λ¦¬ν•˜κΈ° μœ„ν•œ λ°©λ²•μœΌλ‘œ λ§Žμ€ 관심을 λ°›κ³  μžˆλ‹€. ν•˜μ§€λ§Œ, 곡간 λΆ„ν•  μŠ€μΌ€μ€„λ§ 방법을 톡해 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ„ μ‹€ν–‰μ‹œν‚€λŠ” 것은 두 가지 연ꡬ 과제λ₯Ό 가지고 μžˆλ‹€. λ¨Όμ €, 각 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ€ 가변적인 μ½”μ–΄ μžμ› μƒμ—μ„œ 효율적으둜 μ‹€ν–‰λ˜κΈ° μœ„ν•œ λŸ°νƒ€μž„ κΈ°μˆ μ„ ν•„μš”λ‘œ ν•˜κ³ , μŠ€μΌ€μ€„λŸ¬λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ˜ μ„±λŠ₯ νŠΉμ„±μ„ κ³ λ €ν•΄μ„œ λŸ°νƒ€μž„ μ„±λŠ₯을 높일 수 μžˆλ„λ‘ μ λ‹Ήν•œ 수의 μ½”μ–΄ μžμ›μ„ μ œκ³΅ν•΄μ•Όν•œλ‹€. 이 ν•™μœ„ λ…Όλ¬Έμ—μ„œ, μš°λ¦¬λŠ” ν•¨κ»˜ μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ 곡간 λΆ„ ν•  μŠ€μΌ€μ€„λ§μ„ ν†΅ν•΄μ„œ 효율적으둜 μ‹€ν–‰μ‹œν‚€κΈ° μœ„ν•œ 세가지 λŸ°νƒ€μž„ μ‹œμŠ€ν…œ κΈ°μˆ μ„ μ†Œκ°œν•œλ‹€. λ¨Όμ € μš°λ¦¬λŠ” ν˜‘λ™μ μΈ λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ΄λΌλŠ” κΈ°μˆ μ„ μ†Œκ°œν•˜λŠ”λ°, μ΄λŠ” OpenMP 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ—κ²Œ μœ μ—°ν•˜κ³  효율적인 μ‹€ν–‰ ν™˜κ²½μ„ μ œκ³΅ν•œλ‹€. 이 κΈ°μˆ μ€ 곡유 λ©”λͺ¨λ¦¬ 병렬 싀행에 λ‚΄μž¬λ˜μ–΄ μžˆλŠ” νŠΉμ„±μ„ ν™œμš©ν•˜μ—¬ λ³‘λ ¬μ²˜λ¦¬ ν”„λ‘œκ·Έλž¨λ“€μ΄ λ³€ν™”ν•˜λŠ” μ½”μ–΄ μžμ›μ— λ§žμΆ”μ–΄ λ³‘λ ¬μ„±μ˜ 정도λ₯Ό λ™μ μœΌλ‘œ μ‘°μ ˆν•  수 μžˆλ„λ‘ ν•΄μ€€λ‹€. μ΄λŸ¬ν•œ μœ μ—°ν•œ μ‹€ν–‰ λͺ¨λΈμ€ 병렬 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ΄ μ‚¬μš© κ°€λŠ₯ν•œ μ½”μ–΄ μžμ›μ΄ λ™μ μœΌλ‘œ λ³€ν™”ν•˜λŠ” ν™˜κ²½μ—μ„œ μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ μŠ€λ ˆλ“œ μˆ˜μ€€ 병렬성을 닀루지 λͺ»ν•˜λŠ” κΈ°μ‘΄ λŸ°νƒ€μž„ μ‹œμŠ€ν…œλ“€μ— λΉ„ν•΄μ„œ 더 효율적으둜 싀행될 수 μžˆλ„λ‘ ν•΄μ€€λ‹€. λ‘λ²ˆμ§Έλ‘œ, λ³Έ 논문은 μ‚¬μš©λ˜λŠ” μ½”μ–΄ μžμ›μ— λ”°λ₯Έ λ³‘λ ¬μ²˜λ¦¬ ν”„λ‘œκ·Έλž¨μ˜ μ„±λŠ₯ 및 μžμ› ν™œμš©λ„λ₯Ό μ˜ˆμΈ‘ν•  수 μžˆλ„λ‘ ν•΄μ£ΌλŠ” 뢄석적 μ„±λŠ₯ λͺ¨λΈμ„ μ†Œκ°œν•œλ‹€. 병렬 처리 μ½”λ“œμ˜ μ„±λŠ₯ ν™•μž₯성이 일반적으둜 λ©”λͺ¨λ¦¬ μ„±λŠ₯에 μ’Œμš°λœλ‹€λŠ” 관찰에 κΈ°μ΄ˆν•˜μ—¬, 제 μ•ˆλœ 해석 λͺ¨λΈμ€ νμž‰ 이둠을 ν™œμš©ν•˜μ—¬ λ©”λͺ¨λ¦¬ μ‹œμŠ€ν…œμ˜ μ„±λŠ₯ 정보듀을 κ³„μ‚°ν•œλ‹€. 이 νμž‰ μ‹œμŠ€ν…œμ— κΈ°λ°˜ν•œ 방법은 μœ μš©ν•œ μ„±λŠ₯ 정보듀을 μˆ˜μ‹μ„ 톡해 효율적으둜 계산할 수 μžˆλ„λ‘ ν•˜λ©° μƒμš© μ‹œμŠ€ν…œμ—μ„œ μ œκ³΅ν•˜λŠ” ν•˜λ“œμ›¨μ–΄ μ„±λŠ₯ μΉ΄μš΄ν„°λ§Œμ„ μš”κ΅¬ ν•˜κΈ° λ•Œλ¬Έμ— ν™œμš© κ°€λŠ₯μ„± λ˜ν•œ λ†’λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, λ³Έ 논문은 λ™μ‹œμ— μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€ μ‚¬μ΄μ—μ„œ μ½”μ–΄ μžμ›μ„ ν• λ‹Ήν•΄μ£ΌλŠ” ν”„λ ˆμž„μ›Œν¬λ₯Ό μ†Œκ°œν•œλ‹€. μ œμ•ˆλœ ν”„λ ˆμž„μ›Œν¬λŠ” λ™μ‹œμ— 동 μž‘ν•˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ 병렬성 및 μ½”μ–΄ μžμ›μ„ κ΄€λ¦¬ν•˜μ—¬ λ©€ν‹° μ†ŒμΌ“ λ©€ν‹°μ½”μ–΄ μ‹œμŠ€ν…œμ—μ„œ CPU μžμ› 및 λ©”λͺ¨λ¦¬ λŒ€μ—­ν­ μžμ› ν™œμš©λ„λ₯Ό λ™μ‹œμ— 졜적 ν™”ν•œλ‹€. 해석적인 λͺ¨λΈλ§κ³Ό μ œμ•ˆλœ μ½”μ–΄ ν• λ‹Ή ν”„λ ˆμž„μ›Œν¬μ˜ μ„±λŠ₯ 평가λ₯Ό ν†΅ν•΄μ„œ, μš°λ¦¬κ°€ μ œμ•ˆν•˜λŠ” 정책이 일반적인 κ²½μš°μ— CPU μžμ›μ˜ ν™œμš©λ„λ§Œμ„ μ΅œμ ν™”ν•˜λŠ” 방법에 λΉ„ν•΄μ„œ ν•¨κ»˜ λ™μž‘ν•˜λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ˜ μ‹€ν–‰μ‹œκ°„μ„ κ°μ†Œμ‹œν‚¬ 수 μžˆμŒμ„ 보여쀀닀.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 초둝 159 Acknowledgements 161Docto
    • …
    corecore