225 research outputs found

    Scalable RDMA performance in PGAS languages

    Get PDF
    Partitioned global address space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster ofSMPs. Users can program large scale machines with easy-to-use, shared memory paradigms. In order to exploit large scale machines efficiently, PGAS language implementations and their runtime system must be designed for scalability and performance. The IBM XLUPC compiler and runtime system provide a scalable design through the use of the shared variable directory (SVD). The SVD stores meta-information needed to access shared data. It is dereferenced, in the worst case, for every shared memory access, thus exposing a potential performance problem. In this paper we present a cache of remote addresses as an optimization that will reduce the SVD access overhead and allow the exploitation of native (remote) direct memory accesses. It results in a significant performance improvement while maintaining the run-time portability and scalability.Postprint (published version

    RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine

    Get PDF
    Phylogenetic tree reconstruction is one of the grand challenge problems in Bioinformatics. The search for a best-scoring tree with 50 organisms, under a reasonable optimality criterion, creates a topological search space which is as large as the number of atoms in the universe. Computational phylogeny is challenging even for the most powerful supercomputers. It is also an ideal candidate for benchmarking emerging multiprocessor architectures, because it exhibits various levels of fine and coarse-grain parallelism. In this paper, we present the porting, optimization, and evaluation of RAxML on the Cell Broadband Engine. RAxML is a provably efficient, hill climbing algorithm for computing phylogenetic trees based on the Maximum Likelihood (ML) method. The algorithm uses an embarrassingly parallel search method, which also exhibits data-level parallelism and control parallelism in the computation of the likelihood functions. We present the optimization of one of the currently fastest tree search algorithms, on a real Cell blade prototype. We also investigate problems and present solutions pertaining to the optimization of floating point code, control flow, communication, scheduling, and multi-level parallelization on the Cell

    Exploring coordinated software and hardware support for hardware resource allocation

    Get PDF
    Multithreaded processors are now common in the industry as they offer high performance at a low cost. Traditionally, in such processors, the assignation of hardware resources between the multiple threads is done implicitly, by the hardware policies. However, a new class of multithreaded hardware allows the explicit allocation of resources to be controlled or biased by the software. Currently, there is little or no coordination between the allocation of resources done by the hardware and the prioritization of tasks done by the software.This thesis targets to narrow the gap between the software and the hardware, with respect to the hardware resource allocation, by proposing a new explicit resource allocation hardware mechanism and novel schedulers that use the currently available hardware resource allocation mechanisms.It approaches the problem in two different types of computing systems: on the high performance computing domain, we characterize the first processor to present a mechanism that allows the software to bias the allocation hardware resources, the IBM POWER5. In addition, we propose the use of hardware resource allocation as a way to balance high performance computing applications. Finally, we propose two new scheduling mechanisms that are able to transparently and successfully balance applications in real systems using the hardware resource allocation. On the soft real-time domain, we propose a hardware extension to the existing explicit resource allocation hardware and, in addition, two software schedulers that use the explicit allocation hardware to improve the schedulability of tasks in a soft real-time system.In this thesis, we demonstrate that system performance improves by making the software aware of the mechanisms to control the amount of resources given to each running thread. In particular, for the high performance computing domain, we show that it is possible to decrease the execution time of MPI applications biasing the hardware resource assignation between threads. In addition, we show that it is possible to decrease the number of missed deadlines when scheduling tasks in a soft real-time SMT system.Postprint (published version

    Dynamic Multigrain Parallelization on the Cell Broadband Engine

    Get PDF
    This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present policies and mechanisms for adaptive exploitation and scheduling of multiple layers of parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exposed from the runtime system whenever task-level parallelism leaves cores idle. We present a runtime system for scheduling applications with layered parallelism on Cell and investigate its potential with RAxML, a computational biology application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic parallelization methods, that selectively exploit the layers of parallelism in the system, in response to workload characteristics. Our runtime environment outperforms naive parallelization and scheduling based on MPI and Linux by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5--10\% faster than on a single-processor system with a dual-core, quad-thread IBM Power5 processor

    Scalable system software for high performance large-scale applications

    Get PDF
    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Exploring coordinated software and hardware support for hardware resource allocation

    Get PDF
    Multithreaded processors are now common in the industry as they offer high performance at a low cost. Traditionally, in such processors, the assignation of hardware resources between the multiple threads is done implicitly, by the hardware policies. However, a new class of multithreaded hardware allows the explicit allocation of resources to be controlled or biased by the software. Currently, there is little or no coordination between the allocation of resources done by the hardware and the prioritization of tasks done by the software.This thesis targets to narrow the gap between the software and the hardware, with respect to the hardware resource allocation, by proposing a new explicit resource allocation hardware mechanism and novel schedulers that use the currently available hardware resource allocation mechanisms.It approaches the problem in two different types of computing systems: on the high performance computing domain, we characterize the first processor to present a mechanism that allows the software to bias the allocation hardware resources, the IBM POWER5. In addition, we propose the use of hardware resource allocation as a way to balance high performance computing applications. Finally, we propose two new scheduling mechanisms that are able to transparently and successfully balance applications in real systems using the hardware resource allocation. On the soft real-time domain, we propose a hardware extension to the existing explicit resource allocation hardware and, in addition, two software schedulers that use the explicit allocation hardware to improve the schedulability of tasks in a soft real-time system.In this thesis, we demonstrate that system performance improves by making the software aware of the mechanisms to control the amount of resources given to each running thread. In particular, for the high performance computing domain, we show that it is possible to decrease the execution time of MPI applications biasing the hardware resource assignation between threads. In addition, we show that it is possible to decrease the number of missed deadlines when scheduling tasks in a soft real-time SMT system

    Optimizing NANOS OpenMP for the IBM Cyclops multithreaded architecture

    Get PDF
    In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.Peer ReviewedPostprint (author's final draft

    Dynamic multigrain parallelization on the cell broadband engine

    Full text link
    • …
    corecore