Search CORE

7 research outputs found

Argobots 런타임 시스템을 위한 공간 분할 스케줄링 및 부하 분산 기법

Author: 안나
Publication venue: 서울대학교 대학원
Publication date: 01/08/2018
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2018. 8. Bernhard Egger.멀티코어 프로세서는 나날이 진화하고 있다. 멀티 코어 프로세서들은 대규모의 병렬 연산을 수행할 수 있는 잠재력 을 지니고 있지만, 어려운 과제들 또한 가지고 있다. 특히, 모든 코어들을 바쁘게 활용하면서 동기화 비용들을 최소화하는 것은 어려운 일이다. Argobots 는 멀티 코어 OpenMP 어플리케이션들을 위한 런타임 프레임워크이다. 이 논문에서 우리는 부하를 분산하는 방법들의 근본적인 원리와 그 효과를 Argobots 런타임 시스템을 통해서 보여준다. 우리는 Argobots 런타임 프레임워크의 소프트웨어 구조를 수정하고, 워크로드의 부하를 user-level 스레드의 공간 분할 스케줄링을 통해 적절하게 분산시키는 새로운 방법을 적용한다. 개선된 Argobots 런타임 시스템에서 정적 및 동적 부하 분산 기법들이 함께 사용된다. 우리는 NAS Parallel Benchmark 를 사용하여 개선된 Argobots 런타임 시스템의 성능을 측정하고 분석한다. 그 결과 평균적으로 10% 에서 20% 사이의 성능 향상을 확인할 수 있었다. 공간 분할 방법에 기반한 스케줄링은 여러 개의 Argobots 어플리케이션들이 동시에 동작할 때 더 눈에 띄는 성능 향상을 얻을 수 있었다. 즉, 이 논문은 Argobots 런타임 시스템에서 공간 분할 스케줄링과 부하 분산 방법의 효과를 보여준다.As each day passes-by, the evolution of multi-core processors is progressing. While this shift carries a massive parallel processing potential, it equally sets a noteworthy challenge. In particular, keeping all cores busy and minimizing the synchronization costs is a difficult issue. Argobots is a framework based on user-level threading for OpenMP applications on multi-core processors designed and implemented by Argonne National Laboratory. In this thesis, we present the fundamental theory and preeminence of load balancing methods on the Argobots runtime system. We alter the structure of the Argobots runtime framework and then subsequently bring in a new mechanism to balance the workload using user-level thread spatial arrangement. Both the static load and dynamic load balancing methods are used together by the optimized Argobots. Using the NAS Parallel Benchmark suite, we assess and analyze the performance of the optimized Argobots runtime system. We achieve a performance improvement between 10% to 20%. Improvements are more pronounced for multiple co-runnnig applications. To summarize, this thesis reveals the benefits of load balancing and spatial scheduling in the Argobots runtime system.Abstract Contents List of Figures List of Tables Chapter 1 Introduction.......................................................1 1.1 Multi-core Multi-threaded Processors...........................1 1.2 Task Scheduling........................................................2 1.3 Parallel Runtimes......................................................3 1.4 Contributions and Outline...........................................4 Chapter 2 Background and Motivation..................................6 2.1 Background..............................................................6 2.1.1 Multi-threaded Parallelism.......................................6 2.1.2 OpenMP........................................................... 7 2.1.3 BOLT ................................................................8 2.1.4 ARGO...............................................................9 2.1.5 User-Level Threads............................................10 2.1.6 Load Balancing in Multi-core Programming ...........10 2.2 Related Work..........................................................12 2.3 Motivation.............................................................13 Chapter 3 Argobots Runtime Architecture............................15 3.1 Argo Architecture Layout..........................................15 3.2 Argo Runtime: Argobots...........................................16 3.2.1 User-Level Threads and Tasklets.............................17 3.2.2 Execution Streams.............................................19 3.2.3 Scheduler and Pool............................................20 3.2.4 Work Unit Migration..........................................23 Chapter 4 Implementation................................................24 4.1 Data Structures.......................................................26 4.2 Work Load Balancing...............................................28 4.3 Choosing the Threshold............................................30 4.4 Core Allocation.......................................................32 Chapter 5 Evaluation Results and Analysis...........................36 5.1 Benchmarks...........................................................36 5.2 Evaluation of Work Load Balancing............................38 5.3 Evaluation of Different Thresholds.............................39 5.4 Evaluation of Core Allocation....................................39 5.4.1 Executing Two Parallel Applications.....................40 5.4.2 Executing Three or More Parallel Applications.......42 Chapter 6 Conclusions.....................................................47 Bibliography...................................................................48 요약..............................................................................53Maste

SNU Open Repository and Archive

Cooperative hierarchical resource management for efficient composition of parallel software

Author: Pan Heidi, 1980-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-96).There cannot be a thriving software industry in the upcoming manycore era unless programmers can compose arbitrary parallel codes without sacrificing performance. We believe that the efficient composition of parallel codes is best achieved by exposing unvirtualized hardware resources and sharing these cooperatively across parallel codes within an application. This thesis presents Lithe, a user-level framework that enables efficient composition of parallel software components. Lithe provides the basic primitives, standard interface, and thin runtime to enable parallel codes to efficiently use and share processing resources. Lithe can be inserted underneath the runtimes of legacy parallel software environments to provide bolt-on composability - without changing a single line of the original application code. Lithe can also serve as the foundation for building new parallel abstractions and runtime systems that automatically interoperate with one another. We have built and ported a wide range of interoperable scheduling, synchronization, and domain-specific libraries using Lithe. We show that the modifications needed are small and impose no performance penalty when running each library standalone. We also show that Lithe improves the performance of real world applications composed of multiple parallel libraries by simply relinking them with the new library binaries. Moreover, the Lithe version of an application even outperformed a third-party expert-tuned implementation by being more adaptive to different phases of the computation.by Heidi Pan.Ph.D

DSpace@MIT

Adaptive architecture-transparent policy control in a distributed graph reducer

Author: Belikov Evgenij
Publication venue: Mathematical and Computer Sciences
Publication date: 01/05/2019
Field of study

The end of the frequency scaling era occured around 2005 as the clock frequency has stalled for commodity architectures. Thus performance improvements that could in the past be expected with each new hardware generation needed to originate elsewhere. Almost all computer architectures exhibit substantial and growing levels of parallelism, exploiting which became one of the key sources of performance and scalability improvements. Alas, parallel programming proved much more difficult than sequential, due to the need to specify coordination and parallelism management aspects. Whilst low-level languages place the burden on the programmers reducing productivity and portability, semi-implicit approaches delegate the responsibility to sophisticated compilers and run-time systems. This thesis presents a study of adaptive load distribution based on work stealing using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the responsibility of coordination to the run-time system. After characterising a set of parallel functional applications, we study the use of historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not specific to the language and the run-time system, and applies to other work stealing schedulers. Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation on the performance of five Divide-&-Conquer programs run on a cluster of up to 256 PEs. When using Spark Colocation, the distributed graph reducer shares related work resulting in a higher degree of both potential and actual parallelism, and more fine-grained and less variable thread size. We validate this behaviour by observing a reduction in average fetch times, but increased amounts of FETCH messages and of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and speedup improvements for Spark Colocation for the three more regular and nested applications and performance degradation for two programs: one that is excessively fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and higher degree of parallelism have more opportunities to pay off. In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used within the same run and the ancestry information dynamically reconstructed at run time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining the advantages of a flexible architecture-transparent approach.The Scottish Informatics and Computer Science Alliance (SICSA

ROS: The Research Output Service. Heriot-Watt University Edinburgh

A scheduling framework for general-purpose parallel languages

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

Crossref

A scheduling framework for general-purpose parallel languages

Author: Fedorova A.
Feitelson D. G.
John Reppy
Matthew Fluet
Mike Rainey
Nikhil R. S.
Rainey M.
Rainey M.
Reppy J.
Shaw A.
Shivers O.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref