128 research outputs found

    Argobots: A Lightweight Low-Level Threading and Tasking Framework

    Get PDF
    In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, either are too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by end users or high-level programming models. We describe the design, implementation, and performance characterization of Argobots and present integrations with three high-level models: OpenMP, MPI, and colocated I/O services. Evaluations show that (1) Argobots, while providing richer capabilities, is competitive with existing simpler generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency-hiding capabilities; and (4) I/O services with Argobots reduce interference with colocated applications while achieving performance competitive with that of a Pthreads approach

    Language-Centric Performance Analysis of OpenMP Programs with Aftermath

    Get PDF
    International audienceWe present a new set of tools for the language-centric performance analysis and debugging of OpenMP programs that allows programmers to relate dynamic information from parallel execution to OpenMP constructs. Users can visualize execution traces, examine aggregate met-rics on parallel loops and tasks, such as load imbalance or synchronization overhead, and obtain detailed information on specific events, such as the partitioning of a loop's iteration space, its distribution to workers according to the scheduling policy and fine-grain synchronization. Our work is based on the Aftermath performance analysis tool and a ready-to-use, instrumented version of the LLVM/clang OpenMP run-time with negligible overhead for tracing. By analyzing the performance of the MG application of the NPB suite, we show that language-centric performance analysis in general and our tools in particular can help improve the performance of large-scale OpenMP applications significantly

    Performance analysis and tuning in multicore environments

    Get PDF
    Performance analysis is the task of monitor the behavior of a program execution. The main goal is to find out the possible adjustments that might be done in order improve the performance. To be able to get that improvement it is necessary to find the different causes of overhead. Nowadays we are already in the multicore era, but there is a gap between the level of development of the two main divisions of multicore technology (hardware and software). When we talk about multicore we are also speaking of shared memory systems, on this master thesis we talk about the issues involved on the performance analysis and tuning of applications running specifically in a shared Memory system. We move one step ahead to take the performance analysis to another level by analyzing the applications structure and patterns. We also present some tools specifically addressed to the performance analysis of OpenMP multithread application. At the end we present the results of some experiments performed with a set of OpenMP scientific application.Análisis de rendimiento es el área de estudio encargada de monitorizar el comportamiento de la ejecución de programas informáticos. El principal objetivo es encontrar los posibles ajustes que serán necesarios para mejorar el rendimiento. Para poder obtener esa mejora es necesario encontrar las principales causas de overhead. Actualmente estamos sumergidos en la era multicore, pero existe una brecha entre el nivel de desarrollo de sus dos principales divisiones (hardware y software). Cuando hablamos de multicore también estamos hablando de sistemas de memoria compartida. Nosotros damos un paso más al abordar el análisis de rendimiento a otro nivel por medio del estudio de la estructura de las aplicaciones y sus patrones. También presentamos herramientas de análisis de aplicaciones que son específicas para el análisis de rendimiento de aplicaciones paralelas desarrolladas con OpenMP. Al final presentamos los resultados de algunos experimentos realizados con un grupo de aplicaciones científicas desarrolladas bajo este modelo de programación.L'Anàlisi de rendiment és l'àrea d'estudi encarregada de monitorar el comportament de l'execució de programes informàtics. El principal objectiu és trobar els possibles ajustaments que seran necessaris per a millorar el rendiment. Per a poder obtenir aquesta millora és necessari trobar les principals causes de l'overhead (excessos de computació no productiva). Actualment estem immersos en l'era multicore, però existeix una rasa entre el nivell de desenvolupament de les seves dues principals divisions (maquinari i programari). Quan parlam de multicore, també estem parlant de sistemes de memòria compartida. Nosaltres donem un pas més per a abordar l'anàlisi de rendiment en un altre nivell per mitjà de l'estudi de l'estructura de les aplicacions i els seus patrons. També presentem eines d'anàlisis d'aplicacions que són específiques per a l'anàlisi de rendiment d'aplicacions paral·leles desenvolupades amb OpenMP. Al final, presentem els resultats d'alguns experiments realitzats amb un grup d'aplicacions científiques desenvolupades sota aquest model de programació

    On Designing Multicore-aware Simulators for Biological Systems

    Full text link
    The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.Comment: 19 pages + cover pag

    STAPL-RTS: A Runtime System for Massive Parallelism

    Get PDF
    Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics. We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms. Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm

    Performance Analysis Of Pde Based Parallel Algorithms On Different Computer Architectures

    Get PDF
    Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Bilişim Enstitüsü, 2009Thesis (M.Sc.) -- İstanbul Technical University, Institute of Informatics, 2009Son yıllarda dağıtık algoritmaların farklı platformlarda kullanılabilmesi platform ve uygulama bağımsız performans analizi uygulamaları ihtiyacını arttırmıştır. Farklı donanımları ve haberleşme metodlarını destekleyen uygulamalar kullanıcılara donanım ve yazılımdan bağımsız ortak bir zemin hazırladıkları için kolaylık sağlamaktadır. Kısmi fark denklemleri hesaplamalı bilim ve mühendisliğin bir çok alanında kullanılmaktadır (ısı, dalga yayılımı gibi). Bu denklemlerin sayısal çözümü yinelemeli yöntemler kullanılarak yapılmaktadır. Problemin boyutu ve hata değerine göre çözüme ulaşmak için gereken yineleme sayısı ve buna bağlı olarak süresi değişmektedir. Kısmi fark denklemelerinin tek işlemcili bilgisayarlardaki çözümü uzun sürdüğü ve yüksek boyutlarda hafızaları yetersiz kaldığı için paralelleştirilerek birden fazla bilgisayarın işlemcisi ve hafızası kullanılarak çözülmektedir. Tezimde eliptik kısmi fark denklemlerini Gauss-Seidel ve Successive Over-Relaxation (SOR) metodlarını kullanarak çözen paralel algoritmalar kullanılmıştır. Performans analizi ve eniyilemesi kabaca üç adımdan oluşmaktadır; ölçüm, sonuçların analizi, darboğazların tespit edilip yazılımda iyileştirme yapılması. Ölçüm aşamasında programın koşarken ürettiği performans bilgisi toplanır, toplanan bu veriler görselleştirme araçları ile anlaşılır hale getirilerek yorumlanır. Yorumlama aşamasında tespit edilen dar boğazlar belirlenir ve giderilme yöntemleri araştırılır. Gerekli iyileştirmeler yapılarak program yeniden analiz edilir. Bu aşamaların her birinde farklı uygulamalar kullanılabilir fakat tez çalışmamda uygulamaları tek çatı altında toplayan TAU kullanılmıştır. TAU (Tuning and Analysis Utilities) farklı donanımları ve işletim sistemlerini destekleyerek farklı paralelleştirme metodlarını analiz edebilmektedir. Açık kaynak kodlu olan TAU diğer açık kaynak kodlu uygulamalar ile uyumlu olup birçok seviyede bütünleşme sağlanmıştır. Bu tez çalışmasında, iki farklı platformda aynı uygulamanın performans analizi yapılarak platform farkının getirdiği farklılıklar incelenmektedir. Performans analizinde bir algoritmanın eniyilemesini yapmak için genel bir kural olmadığından her algoritma her platformda incelenerek gerekli değişiklikler yapılmalıdır. Bu bağlamda kullandığım PDE algoritmasının her iki sistemdeki analizi sonucu elde edilen bilgiler yorumlanmıştır.In last two decades, use of parallel algorithms on different architectures increased the need of architecture and application independent performance analysis tools. Tools that support different communication methods and hardware prepare a common ground regardless of equipments provided. Partial differential equations (PDE) are used in several applications (such as propagation of heat, wave) in computational science and engineering. These equations can be solved using iterative numerical methods. Problem size and error tolerance effects iteration count and computation time to solve equation. PDE computations take long time using single processor computers with sequential algorithms, and if data size gets bigger single processors memory may be insufficient. Thus, PDE?s are solved using parallel algorithms on multiple processors. In this thesis, elliptic partial differential equation is solved using Gauss-Seidel and Successive Over-Relaxation (SOR) methods parallel algorithms. Performance analysis and optimization basically has three steps; evaluation, analysis of gathered information, defining and optimizing bottlenecks. In evaluation, performance information is gathered while program runs, then observations are made on gathered information by using visualization tools. Bottlenecks are defined and optimization techniques are researched. Necessary improvements are made to analyze the program again. Different applications in each of these stages can be used but in this thesis TAU is used, which collects these applications under one roof. TAU (Tuning and Analysis Utilities) supports many hardware, operating systems and parallelization methods. TAU is an open source application and collaborates with other open source applications at different levels. In this thesis, differences based on performance analysis of an algorithm in different two architectures are investigated. In performance analysis and optimization there is no golden rule to speed up algorithm. Each algorithm must be analyzed on that specific architecture. In this context, the performance analysis of a PDE algorithm on two architectures has been interpreted.Yüksek LisansM.Sc

    Parallel computing 2011, ParCo 2011: book of abstracts

    Get PDF
    This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu
    corecore