370,055 research outputs found

    A Parallel Dual Fast Gradient Method for MPC Applications

    Full text link
    We propose a parallel adaptive constraint-tightening approach to solve a linear model predictive control problem for discrete-time systems, based on inexact numerical optimization algorithms and operator splitting methods. The underlying algorithm first splits the original problem in as many independent subproblems as the length of the prediction horizon. Then, our algorithm computes a solution for these subproblems in parallel by exploiting auxiliary tightened subproblems in order to certify the control law in terms of suboptimality and recursive feasibility, along with closed-loop stability of the controlled system. Compared to prior approaches based on constraint tightening, our algorithm computes the tightening parameter for each subproblem to handle the propagation of errors introduced by the parallelization of the original problem. Our simulations show the computational benefits of the parallelization with positive impacts on performance and numerical conditioning when compared with a recent nonparallel adaptive tightening scheme.Comment: This technical report is an extended version of the paper "A Parallel Dual Fast Gradient Method for MPC Applications" by the same authors submitted to the 54th IEEE Conference on Decision and Contro

    Machine Learning for Run-Time Energy Optimisation in Many-Core Systems

    No full text
    In recent years, the focus of computing has moved away from performance-centric serial computation to energy-efficient parallel computation. This necessitates run-time optimisation techniques to address the dynamic resource requirements of different applications on many-core architectures. In this paper, we report on intelligent run-time algorithms which have been experimentally validated for managing energy and application performance in many-core embedded system. The algorithms are underpinned by a cross-layer system approach where the hardware, system software and application layers work together to optimise the energy-performance trade-off. Algorithm development is motivated by the biological process of how a human brain (acting as an agent) interacts with the external environment (system) changing their respective states over time. This leads to a pay-off for the action taken, and the agent eventually learns to take the optimal/best decisions in future. In particular, our online approach uses a model-free reinforcement learning algorithm that suitably selects the appropriate voltage-frequency scaling based on workload prediction to meet the applicationsโ€™ performance requirements and achieve energy savings of up to 16% in comparison to state-of-the-art-techniques, when tested on four ARM A15 cores of an ODROID-XU3 platform

    Performance predictability of divide and conquer skeletons

    Get PDF
    Parallel divide and conquer computations, encompassing a wide variety of applications, can be modeled and encapsulated as a high level primitive called skeleton. The paper deals with a skeleton designed for parallel divide and conquer algorithms that provide hypercubical communications among processes The paper also introduces an accurate timing model designed for prediction of proposed primitive. The timing analysis model presented here still characterizing the communication time through architecture parameters but introduces a few novelties. The proposal is to introduce different kinds of components to the analytical model by associating a performance constant for each specific conceptual block of the skeleton. The trace files obtained from the execution of the resulting code using the skeleton are used by lineal regression techniques giving us, among other information, the values of the parameters of those blocks. An extended example showing the relative accuracy of the proposed approach concludes the paper.Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informรกtica (RedUNCI

    TokenTLB+CUP: A Token-Based Page Classification with Cooperative Usage Prediction

    Full text link
    [EN] Discerning the private or shared condition of the data accessed by the applications is an increasingly decisive approach to achieving efficiency and scalability in multi- and many-core systems. Since most memory accesses in both sequential and parallel applications are either private (accessed only by one core) or read-only (not written) data, devoting the full cost of coherence to every memory access results in sub-optimal performance and limits the scalability and efficiency of the multiprocessor. This paper introduces TokenTLB, a TLB-based page classification approach based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages, and it does not require the use of complex and undesirable persistent requests or arbitration. In addition, classification is extended with Cooperative Usage Predictor (CUP), a token-based system-wide page usage predictor retrieved through TLB cooperation, in order to perform a classification unaffected by TLB size. Through cycle-accurate simulation we observed that TokenTLB spends 43.6% of cycles as private per page on average, and CUP further increases the time spent as private by 22.0%. CUP avoids 4 out of 5 TLB invalidations when compared to state-of-the-art predictors, thus proving far better prediction accuracy and making usage prediction an attractive mechanism for the first time.This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2015-66972-C5-1-R and TIN2015-66972-C5-3-R and the Fundacion Seneca-Agencia de Ciencia y Tecnologia de la Region de Murcia under the project Jovenes Lideres en Investigacion 18956/JLI/13.Esteve Garcia, A.; Ros Bardisa, A.; Robles Martรญnez, A.; Gรณmez Requena, ME. (2018). TokenTLB+CUP: A Token-Based Page Classification with Cooperative Usage Prediction. IEEE Transactions on Parallel and Distributed Systems. 29(5):1188-1201. https://doi.org/10.1109/TPDS.2017.2782808S1188120129

    EXPLORING MULTIPLE LEVELS OF PERFORMANCE MODELING FOR HETEROGENEOUS SYSTEMS

    Get PDF
    The current trend in High-Performance Computing (HPC) is to extract concurrency from clusters that include heterogeneous resources such as General Purpose Graphical Processing Units (GPGPUs) and Field Programmable Gate Array (FPGAs). Although these heterogeneous systems can provide substantial performance for massively parallel applications, much of the available computing resources are often under-utilized due to inefficient application mapping, load balancing, and tuning. While several performance prediction models exist to efficiently tune applications, they often require significant computing architecture knowledge for reliable prediction. In addition, they do not address multiple levels of design space abstraction and it is often difficult to choose a reliable prediction model for a given design. In this research, we develop a multi-level suite of performance prediction models for heterogeneous systems that primarily targets Synchronous Iterative Algorithms (SIAs). The modeling suite aims to produce accurate and straightforward application runtime prediction prior to the actual large-scale implementation. This suite addresses two levels of system abstraction: 1) low-level where partial knowledge of the application implementation is present along with the system specifications and 2) high-level where the implementation details are minimum and only high-level computing system specifications are given. The performance prediction modeling suite is developed using our proposed Synchronous Iterative GPGPU Execution (SIGE) model for GPGPU clusters, motivated by the RC Amenability Test for Scalable Systems (RATSS) model for FPGA clusters. The low-level abstraction for GPGPU clusters consists of a regression-based performance prediction framework that statistically abstracts system architecture characteristics, enabling performance prediction without detailed architecture knowledge. In this framework, the overall execution time of an application is predicted using regression models developed for host-device computations and network-level communications performed in the algorithm. We have used a family of Spiking Neural Network (SNN) models and an Anisotropic Diffusion Filter (ADF) algorithm as SIA case studies for verification of the regression-based framework and achieved over 90% prediction accuracy compared to the actual implementations for several GPGPU cluster configurations tested. The results establish the adequacy of the low-level abstraction model for advanced, fine-grained performance prediction and design space exploration (DSE). The high-level abstraction consists of the following two primary modeling approaches: qualitative modeling that uses existing subjective-analytical models for computation and communication; and quantitative modeling that predicts computation and communication performance by measuring hardware events associated with objective-analytical models using micro-benchmarks. The performance prediction provided by the high-level abstraction approaches, albeit coarse-grained, delivers useful insight into application performance on the chosen heterogeneous system. A blend of the two high-level modeling approaches, labeled as hybrid modeling, is explored for insightful preliminary performance prediction. The performance prediction models in the multi-level suite are verified and compared for their accuracy and ease-of-use, allowing developers to choose a model that best satisfies their design space abstraction. We also construct a roadmap that guides user from optimal Application-to-Accelerator (A2A) mapping to fine-grained performance prediction, thereby providing a hierarchical approach to optimal application porting on the target heterogeneous system. The end goal of this dissertation research is to offer the HPC community a thorough, non-architecture specific, performance prediction framework in the form of a hierarchical modeling suite that enables them to optimally utilize the heterogeneous resources

    ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ์œ„ํ•œ ๋ณ‘๋ ฌ์„ฑ ๊ด€๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ํ•จ๊ป˜ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ ์€ ์ฃผ์–ด์ง„ ํ•˜๋“œ์›จ์–ด ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ˜„์žฌ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๋™์‹œ์— ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ด๋‹ค. OpenMP์™€ ๊ฐ™์ด ํ†ต์ƒ ์‚ฌ ์šฉ๋˜๋Š” ๋ณ‘๋ ฌํ™” ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์€ ๋ชจ๋“  ํ•˜๋“œ์›จ์–ด ์ฝ”์–ด ์ž์›์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์ฝ”์–ด ๊ฐœ์ˆ˜ ๋งŒํผ ์Šค๋ ˆ๋“œ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚จ๋‹ค. ์ด ๋•Œ, ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋ชจ๋“  ์ฝ”์–ด ์ž์›์„ ํ™œ์šฉํ•  ๋•Œ ์˜คํžˆ๋ ค ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์–ป์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ์šด์˜์ฒด์ œ ์ปค๋„์˜ ๋ถ€ํ•˜๋Š” ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์Šค๋ ˆ๋“œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์†ํ•ด์„œ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ ์–ดํ”Œ๋ฆฌ ์ผ€์ด์…˜์—๊ฒŒ ๋…๋ฆฝ์ ์ธ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง์€ ์ ์  ๋” ๋Š˜์–ด๋‚˜๋Š” ์ฝ”์–ด ์ž์›์˜ ๊ฐœ์ˆ˜๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ณต๊ฐ„ ๋ถ„ํ•  ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋‘ ๊ฐ€์ง€ ์—ฐ๊ตฌ ๊ณผ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋จผ์ €, ๊ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๊ฐ€๋ณ€์ ์ธ ์ฝ”์–ด ์ž์› ์ƒ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋˜๊ธฐ ์œ„ํ•œ ๋Ÿฐํƒ€์ž„ ๊ธฐ์ˆ ์„ ํ•„์š”๋กœ ํ•˜๊ณ , ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด์„œ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋„๋ก ์ ๋‹นํ•œ ์ˆ˜์˜ ์ฝ”์–ด ์ž์›์„ ์ œ๊ณตํ•ด์•ผํ•œ๋‹ค. ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๊ณต๊ฐ„ ๋ถ„ ํ•  ์Šค์ผ€์ค„๋ง์„ ํ†ตํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์„ธ๊ฐ€์ง€ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋จผ์ € ์šฐ๋ฆฌ๋Š” ํ˜‘๋™์ ์ธ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜๋Š”๋ฐ, ์ด๋Š” OpenMP ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ์œ ์—ฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ์‹คํ–‰ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ์‹คํ–‰์— ๋‚ด์žฌ๋˜์–ด ์žˆ๋Š” ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋ณ€ํ™”ํ•˜๋Š” ์ฝ”์–ด ์ž์›์— ๋งž์ถ”์–ด ๋ณ‘๋ ฌ์„ฑ์˜ ์ •๋„๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด๋Ÿฌํ•œ ์œ ์—ฐํ•œ ์‹คํ–‰ ๋ชจ๋ธ์€ ๋ณ‘๋ ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ฝ”์–ด ์ž์›์ด ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์„ ๋‹ค๋ฃจ์ง€ ๋ชปํ•˜๋Š” ๊ธฐ์กด ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์— ๋น„ํ•ด์„œ ๋” ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ์‚ฌ์šฉ๋˜๋Š” ์ฝ”์–ด ์ž์›์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ ๋ฐ ์ž์› ํ™œ์šฉ๋„๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๋ถ„์„์  ์„ฑ๋Šฅ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ฝ”๋“œ์˜ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์„ฑ๋Šฅ์— ์ขŒ์šฐ๋œ๋‹ค๋Š” ๊ด€์ฐฐ์— ๊ธฐ์ดˆํ•˜์—ฌ, ์ œ ์•ˆ๋œ ํ•ด์„ ๋ชจ๋ธ์€ ํ์ž‰ ์ด๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ํ์ž‰ ์‹œ์Šคํ…œ์— ๊ธฐ๋ฐ˜ํ•œ ๋ฐฉ๋ฒ•์€ ์œ ์šฉํ•œ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ์ˆ˜์‹์„ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ ์ƒ์šฉ ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์„ฑ๋Šฅ ์นด์šดํ„ฐ๋งŒ์„ ์š”๊ตฌ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ ๋˜ํ•œ ๋†’๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค ์‚ฌ์ด์—์„œ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋™์‹œ์— ๋™ ์ž‘ํ•˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋ณ‘๋ ฌ์„ฑ ๋ฐ ์ฝ”์–ด ์ž์›์„ ๊ด€๋ฆฌํ•˜์—ฌ ๋ฉ€ํ‹ฐ ์†Œ์ผ“ ๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ CPU ์ž์› ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ž์› ํ™œ์šฉ๋„๋ฅผ ๋™์‹œ์— ์ตœ์  ํ™”ํ•œ๋‹ค. ํ•ด์„์ ์ธ ๋ชจ๋ธ๋ง๊ณผ ์ œ์•ˆ๋œ ์ฝ”์–ด ํ• ๋‹น ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด์„œ, ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” ์ •์ฑ…์ด ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— CPU ์ž์›์˜ ํ™œ์šฉ๋„๋งŒ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋น„ํ•ด์„œ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์‹คํ–‰์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 ์ดˆ๋ก 159 Acknowledgements 161Docto

    Optimization of communication intensive applications on HPC networks

    Get PDF
    Communication is a necessary but overhead inducing component of parallel programming. Its impact on application design and performance is due to several related aspects of a parallel job execution: network topology, routing protocol, suitability of algorithm being used to the network, job placement, etc. This thesis is aimed at developing an understanding of how communication plays out on networks of high performance computing systems and exploring methods that can be used to improve communication performance of large scale applications. Broadly speaking, three topics have been studied in detail in this thesis. The first of these topics is task mapping and job placement on practical installations of torus and dragonfly networks. Next, use of supervised learning algorithms for conducting diagnostic studies of how communication evolves on networks is explored. Finally, efficacy of packet-level simulations for prediction-based studies of communication performance on different networks using different network parameters is analyzed. The primary contribution of this thesis is development of scalable diagnostic and prediction methods that can assist in the process of network designing, adapting applications to future systems, and optimizing execution of applications on existing systems. These meth- ods include a supervised learning approach, a functional modeling tool (called Damselfly), and a PDES-based packet level simulator (called TraceR), all of which are described in this thesis

    Big Data Application and System Co-optimization in Cloud and HPC Environment

    Get PDF
    The emergence of big data requires powerful computational resources and memory subsystems that can be scaled efficiently to accommodate its demands. Cloud is a new well-established computing paradigm that can offer customized computing and memory resources to meet the scalable demands of big data applications. In addition, the flexible pay-as-you-go pricing model offers opportunities for using large scale of resources with low cost and no infrastructure maintenance burdens. High performance computing (HPC) on the other hand also has powerful infrastructure that has potential to support big data applications. In this dissertation, we explore the application and system co-optimization opportunities to support big data in both cloud and HPC environments. Specifically, we explore the unique features of both application and system to seek overlooked optimization opportunities or tackle challenges that are difficult to be addressed by only looking at the application or system individually. Based on the characteristics of the workloads and their underlying systems to derive the optimized deployment and runtime schemes, we divide the workflow into four categories: 1) memory intensive applications; 2) compute intensive applications; 3) both memory and compute intensive applications; 4) I/O intensive applications.When deploying memory intensive big data applications to the public clouds, one important yet challenging problem is selecting a specific instance type whose memory capacity is large enough to prevent out-of-memory errors while the cost is minimized without violating performance requirements. In this dissertation, we propose two techniques for efficient deployment of big data applications with dynamic and intensive memory footprint in the cloud. The first approach builds a performance-cost model that can accurately predict how, and by how much, virtual memory size would slow down the application and consequently, impact the overall monetary cost. The second approach employs a lightweight memory usage prediction methodology based on dynamic meta-models adjusted by the application's own traits. The key idea is to eliminate the periodical checkpointing and migrate the application only when the predicted memory usage exceeds the physical allocation. When applying compute intensive applications to the clouds, it is critical to make the applications scalable so that it can benefit from the massive cloud resources. In this dissertation, we first use the Kirchhoff law, which is one of the most widely used physical laws in many engineering principles, as an example workload for our study. The key challenge of applying the Kirchhoff law to real-world applications at scale lies in the high, if not prohibitive, computational cost to solve a large number of nonlinear equations. In this dissertation, we propose a high-performance deep-learning-based approach for Kirchhoff analysis, namely HDK. HDK employs two techniques to improve the performance: (i) early pruning of unqualified input candidates which simplify the equation and select a meaningful input data range; (ii) parallelization of forward labelling which execute steps of the problem in parallel. When it comes to both memory and compute intensive applications in clouds, we use blockchain system as a benchmark. Existing blockchain frameworks exhibit a technical barrier for many users to modify or test out new research ideas in blockchains. To make it worse, many advantages of blockchain systems can be demonstrated only at large scales, which are not always available to researchers. In this dissertation, we develop an accurate and efficient emulating system to replay the execution of large-scale blockchain systems on tens of thousands of nodes in the cloud. For I/O intensive applications, we observe one important yet often neglected side effect of lossy scientific data compression. Lossy compression techniques have demonstrated promising results in significantly reducing the scientific data size while guaranteeing the compression error bounds, but the compressed data size is often highly skewed and thus impact the performance of parallel I/O. Therefore, we believe it is critical to pay more attention to the unbalanced parallel I/O caused by lossy scientific data compression

    Analysis and design development of parallel 3-D mesh refinement algorithms for finite element electromagnetics with tetrahedra

    Get PDF
    Optimal partitioning of three-dimensional (3-D) mesh applications necessitates dynamically determining and optimizing for the most time-inhibiting factors, such as load imbalance and communication volume. One challenge is to create an analytical model where the programmer can focus on optimizing load imbalance or communication volume to reduce execution time. Another challenge is the best individual performance of a specific mesh refinement demands precise study and the selection of the suitable computation strategy. Very-large-scale finite element method (FEM) applications require sophisticated capabilities for using the underlying parallel computer's resources in the most efficient way. Thus, classifying these requirements in a manner that conforms to the programmer is crucial.This thesis contributes a simulation-based approach for the algorithm analysis and design of parallel, 3-D FEM mesh refinement that utilizes Petri Nets (PN) as the modeling and simulation tool. PN models are implemented based on detailed software prototypes and system architectures, which imitate the behaviour of the parallel meshing process. Subsequently, estimates for performance measures are derived from discrete event simulations. New communication strategies are contributed in the thesis for parallel mesh refinement that pipeline the computation and communication time by means of the workload prediction approach and task breaking point approach. To examine the performance of these new designs, PN models are created for modeling and simulating each of them and their efficiencies are justified by the simulation results. Also based on the PN modeling approach, the performance of a Random Polling Dynamic Load Balancing protocol has been examined. Finally, the PN models are validated by a MPI benchmarking program running on the real multiprocessor system. The advantages of new pipelined communication designs as well as the benefits of PN approach for evaluating and developing high performance parallel mesh refinement algorithms are demonstrated
    • โ€ฆ
    corecore