7 research outputs found
Argobots λ°νμ μμ€ν μ μν κ³΅κ° λΆν μ€μΌμ€λ§ λ° λΆν λΆμ° κΈ°λ²
νμλ
Όλ¬Έ (μμ¬)-- μμΈλνκ΅ λνμ : 곡과λν μ»΄ν¨ν°κ³΅νλΆ, 2018. 8. Bernhard Egger.λ©ν°μ½μ΄ νλ‘μΈμλ λλ μ΄ μ§ννκ³ μλ€. λ©ν° μ½μ΄ νλ‘μΈμλ€μ λκ·λͺ¨μ λ³λ ¬ μ°μ°μ μνν μ μλ μ μ¬λ ₯ μ μ§λκ³ μμ§λ§, μ΄λ €μ΄ κ³Όμ λ€ λν κ°μ§κ³ μλ€. νΉν, λͺ¨λ μ½μ΄λ€μ λ°μκ² νμ©νλ©΄μ λκΈ°ν λΉμ©λ€μ μ΅μννλ κ²μ μ΄λ €μ΄ μΌμ΄λ€.
Argobots λ λ©ν° μ½μ΄ OpenMP μ΄ν리μΌμ΄μ
λ€μ μν λ°νμ νλ μμν¬μ΄λ€. μ΄ λ
Όλ¬Έμμ μ°λ¦¬λ λΆνλ₯Ό λΆμ°νλ λ°©λ²λ€μ κ·Όλ³Έμ μΈ μ리μ κ·Έ ν¨κ³Όλ₯Ό Argobots λ°νμ μμ€ν
μ ν΅ν΄μ 보μ¬μ€λ€. μ°λ¦¬λ Argobots λ°νμ νλ μμν¬μ μννΈμ¨μ΄ ꡬ쑰λ₯Ό μμ νκ³ , μν¬λ‘λμ λΆνλ₯Ό user-level μ€λ λμ κ³΅κ° λΆν μ€μΌμ€λ§μ ν΅ν΄ μ μ νκ² λΆμ°μν€λ μλ‘μ΄ λ°©λ²μ μ μ©νλ€. κ°μ λ Argobots λ°νμ μμ€ν
μμ μ μ λ° λμ λΆν λΆμ° κΈ°λ²λ€μ΄ ν¨κ» μ¬μ©λλ€.
μ°λ¦¬λ NAS Parallel Benchmark λ₯Ό μ¬μ©νμ¬ κ°μ λ Argobots λ°νμ μμ€ν
μ μ±λ₯μ μΈ‘μ νκ³ λΆμνλ€. κ·Έ κ²°κ³Ό νκ· μ μΌλ‘ 10% μμ 20% μ¬μ΄μ μ±λ₯ ν₯μμ νμΈν μ μμλ€. κ³΅κ° λΆν λ°©λ²μ κΈ°λ°ν μ€μΌμ€λ§μ μ¬λ¬ κ°μ Argobots μ΄ν리μΌμ΄μ
λ€μ΄ λμμ λμν λ λ λμ λλ μ±λ₯ ν₯μμ μ»μ μ μμλ€. μ¦, μ΄ λ
Όλ¬Έμ Argobots λ°νμ μμ€ν
μμ κ³΅κ° λΆν μ€μΌμ€λ§κ³Ό λΆν λΆμ° λ°©λ²μ ν¨κ³Όλ₯Ό 보μ¬μ€λ€.As each day passes-by, the evolution of multi-core processors is progressing. While this shift carries a massive parallel processing potential, it equally sets a noteworthy challenge. In particular, keeping all cores busy and minimizing the synchronization costs is a difficult issue.
Argobots is a framework based on user-level threading for OpenMP applications on multi-core processors designed and implemented by Argonne National Laboratory. In this thesis, we present the fundamental theory and preeminence of load balancing methods on the Argobots runtime system. We alter the structure of the Argobots runtime framework and then subsequently bring in a new mechanism to balance the workload using user-level thread spatial arrangement. Both the static load and dynamic load balancing methods are used together by the optimized Argobots.
Using the NAS Parallel Benchmark suite, we assess and analyze the performance of the optimized Argobots runtime system. We achieve a performance improvement between 10% to 20%. Improvements are more pronounced for multiple co-runnnig applications. To summarize, this thesis reveals the benefits of load balancing and spatial scheduling in the Argobots runtime system.Abstract
Contents
List of Figures
List of Tables
Chapter 1 Introduction.......................................................1
1.1 Multi-core Multi-threaded Processors...........................1
1.2 Task Scheduling........................................................2
1.3 Parallel Runtimes......................................................3
1.4 Contributions and Outline...........................................4
Chapter 2 Background and Motivation..................................6
2.1 Background..............................................................6
2.1.1 Multi-threaded Parallelism.......................................6
2.1.2 OpenMP........................................................... 7
2.1.3 BOLT ................................................................8
2.1.4 ARGO...............................................................9
2.1.5 User-Level Threads............................................10
2.1.6 Load Balancing in Multi-core Programming ...........10
2.2 Related Work..........................................................12
2.3 Motivation.............................................................13
Chapter 3 Argobots Runtime Architecture............................15
3.1 Argo Architecture Layout..........................................15
3.2 Argo Runtime: Argobots...........................................16
3.2.1 User-Level Threads and Tasklets.............................17
3.2.2 Execution Streams.............................................19
3.2.3 Scheduler and Pool............................................20
3.2.4 Work Unit Migration..........................................23
Chapter 4 Implementation................................................24
4.1 Data Structures.......................................................26
4.2 Work Load Balancing...............................................28
4.3 Choosing the Threshold............................................30
4.4 Core Allocation.......................................................32
Chapter 5 Evaluation Results and Analysis...........................36
5.1 Benchmarks...........................................................36
5.2 Evaluation of Work Load Balancing............................38
5.3 Evaluation of Different Thresholds.............................39
5.4 Evaluation of Core Allocation....................................39
5.4.1 Executing Two Parallel Applications.....................40
5.4.2 Executing Three or More Parallel Applications.......42
Chapter 6 Conclusions.....................................................47
Bibliography...................................................................48
μμ½..............................................................................53Maste
Cooperative hierarchical resource management for efficient composition of parallel software
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-96).There cannot be a thriving software industry in the upcoming manycore era unless programmers can compose arbitrary parallel codes without sacrificing performance. We believe that the efficient composition of parallel codes is best achieved by exposing unvirtualized hardware resources and sharing these cooperatively across parallel codes within an application. This thesis presents Lithe, a user-level framework that enables efficient composition of parallel software components. Lithe provides the basic primitives, standard interface, and thin runtime to enable parallel codes to efficiently use and share processing resources. Lithe can be inserted underneath the runtimes of legacy parallel software environments to provide bolt-on composability - without changing a single line of the original application code. Lithe can also serve as the foundation for building new parallel abstractions and runtime systems that automatically interoperate with one another. We have built and ported a wide range of interoperable scheduling, synchronization, and domain-specific libraries using Lithe. We show that the modifications needed are small and impose no performance penalty when running each library standalone. We also show that Lithe improves the performance of real world applications composed of multiple parallel libraries by simply relinking them with the new library binaries. Moreover, the Lithe version of an application even outperformed a third-party expert-tuned implementation by being more adaptive to different phases of the computation.by Heidi Pan.Ph.D
Adaptive architecture-transparent policy control in a distributed graph reducer
The end of the frequency scaling era occured around 2005 as the clock frequency
has stalled for commodity architectures. Thus performance improvements that could
in the past be expected with each new hardware generation needed to originate
elsewhere. Almost all computer architectures exhibit substantial and growing levels
of parallelism, exploiting which became one of the key sources of performance and
scalability improvements. Alas, parallel programming proved much more difficult
than sequential, due to the need to specify coordination and parallelism management
aspects. Whilst low-level languages place the burden on the programmers reducing
productivity and portability, semi-implicit approaches delegate the responsibility to
sophisticated compilers and run-time systems.
This thesis presents a study of adaptive load distribution based on work stealing
using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible
run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the
responsibility of coordination to the run-time system.
After characterising a set of parallel functional applications, we study the use of
historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel
and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer
applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not
specific to the language and the run-time system, and applies to other work stealing
schedulers.
Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation
on the performance of five Divide-&-Conquer programs run on a cluster of up to
256 PEs. When using Spark Colocation, the distributed graph reducer shares related
work resulting in a higher degree of both potential and actual parallelism, and more
fine-grained and less variable thread size. We validate this behaviour by observing
a reduction in average fetch times, but increased amounts of FETCH messages and
of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and
speedup improvements for Spark Colocation for the three more regular and nested
applications and performance degradation for two programs: one that is excessively
fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and
higher degree of parallelism have more opportunities to pay off.
In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used
within the same run and the ancestry information dynamically reconstructed at run
time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining
the advantages of a flexible architecture-transparent approach.The Scottish Informatics and Computer Science Alliance (SICSA