7 research outputs found

    Argobots λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ„ μœ„ν•œ 곡간 λΆ„ν•  μŠ€μΌ€μ€„λ§ 및 λΆ€ν•˜ λΆ„μ‚° 기법

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (석사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀, 2018. 8. Bernhard Egger.λ©€ν‹°μ½”μ–΄ ν”„λ‘œμ„Έμ„œλŠ” λ‚˜λ‚ μ΄ μ§„ν™”ν•˜κ³  μžˆλ‹€. λ©€ν‹° μ½”μ–΄ ν”„λ‘œμ„Έμ„œλ“€μ€ λŒ€κ·œλͺ¨μ˜ 병렬 연산을 μˆ˜ν–‰ν•  수 μžˆλŠ” 잠재λ ₯ 을 μ§€λ‹ˆκ³  μžˆμ§€λ§Œ, μ–΄λ €μš΄ κ³Όμ œλ“€ λ˜ν•œ 가지고 μžˆλ‹€. 특히, λͺ¨λ“  코어듀을 λ°”μ˜κ²Œ ν™œμš©ν•˜λ©΄μ„œ 동기화 λΉ„μš©λ“€μ„ μ΅œμ†Œν™”ν•˜λŠ” 것은 μ–΄λ €μš΄ 일이닀. Argobots λŠ” λ©€ν‹° μ½”μ–΄ OpenMP μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ μœ„ν•œ λŸ°νƒ€μž„ ν”„λ ˆμž„μ›Œν¬μ΄λ‹€. 이 λ…Όλ¬Έμ—μ„œ μš°λ¦¬λŠ” λΆ€ν•˜λ₯Ό λΆ„μ‚°ν•˜λŠ” λ°©λ²•λ“€μ˜ 근본적인 원리와 κ·Έ 효과λ₯Ό Argobots λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ„ ν†΅ν•΄μ„œ 보여쀀닀. μš°λ¦¬λŠ” Argobots λŸ°νƒ€μž„ ν”„λ ˆμž„μ›Œν¬μ˜ μ†Œν”„νŠΈμ›¨μ–΄ ꡬ쑰λ₯Ό μˆ˜μ •ν•˜κ³ , μ›Œν¬λ‘œλ“œμ˜ λΆ€ν•˜λ₯Ό user-level μŠ€λ ˆλ“œμ˜ 곡간 λΆ„ν•  μŠ€μΌ€μ€„λ§μ„ 톡해 μ μ ˆν•˜κ²Œ λΆ„μ‚°μ‹œν‚€λŠ” μƒˆλ‘œμš΄ 방법을 μ μš©ν•œλ‹€. κ°œμ„ λœ Argobots λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ—μ„œ 정적 및 동적 λΆ€ν•˜ λΆ„μ‚° 기법듀이 ν•¨κ»˜ μ‚¬μš©λœλ‹€. μš°λ¦¬λŠ” NAS Parallel Benchmark λ₯Ό μ‚¬μš©ν•˜μ—¬ κ°œμ„ λœ Argobots λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ˜ μ„±λŠ₯을 μΈ‘μ •ν•˜κ³  λΆ„μ„ν•œλ‹€. κ·Έ κ²°κ³Ό ν‰κ· μ μœΌλ‘œ 10% μ—μ„œ 20% μ‚¬μ΄μ˜ μ„±λŠ₯ ν–₯상을 확인할 수 μžˆμ—ˆλ‹€. 곡간 λΆ„ν•  방법에 κΈ°λ°˜ν•œ μŠ€μΌ€μ€„λ§μ€ μ—¬λŸ¬ 개의 Argobots μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ΄ λ™μ‹œμ— λ™μž‘ν•  λ•Œ 더 λˆˆμ— λ„λŠ” μ„±λŠ₯ ν–₯상을 얻을 수 μžˆμ—ˆλ‹€. 즉, 이 논문은 Argobots λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ—μ„œ 곡간 λΆ„ν•  μŠ€μΌ€μ€„λ§κ³Ό λΆ€ν•˜ λΆ„μ‚° λ°©λ²•μ˜ 효과λ₯Ό 보여쀀닀.As each day passes-by, the evolution of multi-core processors is progressing. While this shift carries a massive parallel processing potential, it equally sets a noteworthy challenge. In particular, keeping all cores busy and minimizing the synchronization costs is a difficult issue. Argobots is a framework based on user-level threading for OpenMP applications on multi-core processors designed and implemented by Argonne National Laboratory. In this thesis, we present the fundamental theory and preeminence of load balancing methods on the Argobots runtime system. We alter the structure of the Argobots runtime framework and then subsequently bring in a new mechanism to balance the workload using user-level thread spatial arrangement. Both the static load and dynamic load balancing methods are used together by the optimized Argobots. Using the NAS Parallel Benchmark suite, we assess and analyze the performance of the optimized Argobots runtime system. We achieve a performance improvement between 10% to 20%. Improvements are more pronounced for multiple co-runnnig applications. To summarize, this thesis reveals the benefits of load balancing and spatial scheduling in the Argobots runtime system.Abstract Contents List of Figures List of Tables Chapter 1 Introduction.......................................................1 1.1 Multi-core Multi-threaded Processors...........................1 1.2 Task Scheduling........................................................2 1.3 Parallel Runtimes......................................................3 1.4 Contributions and Outline...........................................4 Chapter 2 Background and Motivation..................................6 2.1 Background..............................................................6 2.1.1 Multi-threaded Parallelism.......................................6 2.1.2 OpenMP........................................................... 7 2.1.3 BOLT ................................................................8 2.1.4 ARGO...............................................................9 2.1.5 User-Level Threads............................................10 2.1.6 Load Balancing in Multi-core Programming ...........10 2.2 Related Work..........................................................12 2.3 Motivation.............................................................13 Chapter 3 Argobots Runtime Architecture............................15 3.1 Argo Architecture Layout..........................................15 3.2 Argo Runtime: Argobots...........................................16 3.2.1 User-Level Threads and Tasklets.............................17 3.2.2 Execution Streams.............................................19 3.2.3 Scheduler and Pool............................................20 3.2.4 Work Unit Migration..........................................23 Chapter 4 Implementation................................................24 4.1 Data Structures.......................................................26 4.2 Work Load Balancing...............................................28 4.3 Choosing the Threshold............................................30 4.4 Core Allocation.......................................................32 Chapter 5 Evaluation Results and Analysis...........................36 5.1 Benchmarks...........................................................36 5.2 Evaluation of Work Load Balancing............................38 5.3 Evaluation of Different Thresholds.............................39 5.4 Evaluation of Core Allocation....................................39 5.4.1 Executing Two Parallel Applications.....................40 5.4.2 Executing Three or More Parallel Applications.......42 Chapter 6 Conclusions.....................................................47 Bibliography...................................................................48 μš”μ•½..............................................................................53Maste

    Cooperative hierarchical resource management for efficient composition of parallel software

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-96).There cannot be a thriving software industry in the upcoming manycore era unless programmers can compose arbitrary parallel codes without sacrificing performance. We believe that the efficient composition of parallel codes is best achieved by exposing unvirtualized hardware resources and sharing these cooperatively across parallel codes within an application. This thesis presents Lithe, a user-level framework that enables efficient composition of parallel software components. Lithe provides the basic primitives, standard interface, and thin runtime to enable parallel codes to efficiently use and share processing resources. Lithe can be inserted underneath the runtimes of legacy parallel software environments to provide bolt-on composability - without changing a single line of the original application code. Lithe can also serve as the foundation for building new parallel abstractions and runtime systems that automatically interoperate with one another. We have built and ported a wide range of interoperable scheduling, synchronization, and domain-specific libraries using Lithe. We show that the modifications needed are small and impose no performance penalty when running each library standalone. We also show that Lithe improves the performance of real world applications composed of multiple parallel libraries by simply relinking them with the new library binaries. Moreover, the Lithe version of an application even outperformed a third-party expert-tuned implementation by being more adaptive to different phases of the computation.by Heidi Pan.Ph.D

    Adaptive architecture-transparent policy control in a distributed graph reducer

    Get PDF
    The end of the frequency scaling era occured around 2005 as the clock frequency has stalled for commodity architectures. Thus performance improvements that could in the past be expected with each new hardware generation needed to originate elsewhere. Almost all computer architectures exhibit substantial and growing levels of parallelism, exploiting which became one of the key sources of performance and scalability improvements. Alas, parallel programming proved much more difficult than sequential, due to the need to specify coordination and parallelism management aspects. Whilst low-level languages place the burden on the programmers reducing productivity and portability, semi-implicit approaches delegate the responsibility to sophisticated compilers and run-time systems. This thesis presents a study of adaptive load distribution based on work stealing using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the responsibility of coordination to the run-time system. After characterising a set of parallel functional applications, we study the use of historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not specific to the language and the run-time system, and applies to other work stealing schedulers. Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation on the performance of five Divide-&-Conquer programs run on a cluster of up to 256 PEs. When using Spark Colocation, the distributed graph reducer shares related work resulting in a higher degree of both potential and actual parallelism, and more fine-grained and less variable thread size. We validate this behaviour by observing a reduction in average fetch times, but increased amounts of FETCH messages and of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and speedup improvements for Spark Colocation for the three more regular and nested applications and performance degradation for two programs: one that is excessively fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and higher degree of parallelism have more opportunities to pay off. In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used within the same run and the ancestry information dynamically reconstructed at run time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining the advantages of a flexible architecture-transparent approach.The Scottish Informatics and Computer Science Alliance (SICSA

    A scheduling framework for general-purpose parallel languages

    No full text
    corecore