440 research outputs found

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

    Special Issue: Algorithm/Architecture Co-Exploration of Visual Computing on Emerging Platforms

    Get PDF

    Leveraging Grammars For OpenMP Development in Supercomputing Environments

    Get PDF
    This thesis proposes a solution to streamline the process of using supercomputing re- sources on Southern Methodist University’s ManeFrame II supercomputer. A large segment of the research community that uses ManeFrame II belong outside of the computer science department and the Lyle School of Engineering. While users know how to apply compu- tation to their field, their knowledge does not necessarily extend to the suite of tools and operating system that are required to use ManeFrame II. To solve this, the thesis proposes an interface for those who have little knowledge of Linux and SLURM to be able to use the supercomputing resources that SMU’s Center for Scientific Computation provides. OpenMP is a compiler extension for C, C++ and Fortran that generates a binary using multithreading using in-code directives. With knowledge of OpenMP, researchers are already able to split their code into multiple threads of execution. However, because of the complexity of Linux and SLURM, using OpenMP with the supercomputer can be problematic. This thesis focuses on the user of ANTLR, a programming language recognition tool. This tool allows for the insertion of directives into code which serves to generate batch files that are compatible with the supercomputer scheduling software, SLURM. With the batch file, the user is then able to submit their code to the supercomputer. Additional tools around this core piece of software facilitate a usable interface. In order to make the tool accessible to those without a background in software, the proposed forward- facing solution is a user interface to upload their code and returns a batch file that the user can use to run their code. This eliminates the need for a new user to download, compile and run the ANTLR distribution to generate a batch file. By abstracting away these complexities into a web interface, the solution can generate a batch submission file for the user. Additional tooling assists the user in finding empty nodes for code execution, testing the compilation of their code on the supercomputer and running a timed sample of their code to ensure that OpenMP is leading to a speedup in execution time

    Optimize parallel numerical applications for climate modelling

    Get PDF
    Aquest projecte vol avaluar els possibles beneficis d'implementar paral·lelisme amb memòria compartida en la versió més recent del model NEMO, el qual actualment només fa servir paral·lelisme amb memòria distribuida utilitzant MPI. Generalment les paral·lelitzacions híbrides, que explotan memòria distribuida i compartida, fent servir ambdós paradigmes de paral·lelisme són més eficients. Amb el llançament de l'última versió de NEMO 4.2 amb millores a l'escalabilitat, volem avaluar el rendiment de OpenMP per a implementar el paral·lelisme híbrid amb els objectius de millorar l'escalabilitat del model i preparar-lo per a les noves arquitectures de clusters, les quals estan tendint a incrementar el nombre de nuclis per node.This project wants to evaluate the possible benefits of implementing shared memory parallelism in the most recent version of the NEMO model which currently uses distributed memory parallelism with MPI. Generally, hybrid parallelizations, which exploit distributed and shared memory, using both parallelism paradigms are more efficient. With the release of the latest version of NEMO 4.2 with improvements on the scalability, we want to evaluate the performance of OpenMP to implement the hybrid parallelism in order to improve the model's scalability and making it better suited for the new cluster architectures, which are tending towards increasing the amount of cores per node

    A Comparison of wide area network performance using virtualized and non-virtualized client architectures

    Get PDF
    The goal of this thesis is to determine if there is a significant performance difference between two network computer architecture models. The study will measure latency and throughput for both client-server and virtualized client architectures. In the client server environment, the client computer performs a significant portion of the work and frequently requires downloading uploading files to and from a remote location. Virtual client architecture turns the client machine into a terminal, sending only keystrokes and mouse clicks and receiving only display pixel or sound changes. I accomplished the goal of comparing these architectures by comparing completion times for ping reply, file download, a small set of common work tasks, and a moderately large SQL database query. I compared these tasks using simulated wide area network, local area network, and virtual client network architectures. The study limits the architecture to one where the virtual client and server are in the same data center

    MURAC: A unified machine model for heterogeneous computers

    Get PDF
    Includes bibliographical referencesHeterogeneous computing enables the performance and energy advantages of multiple distinct processing architectures to be efficiently exploited within a single machine. These systems are capable of delivering large performance increases by matching the applications to architectures that are most suited to them. The Multiple Runtime-reconfigurable Architecture Computer (MURAC) model has been proposed to tackle the problems commonly found in the design and usage of these machines. This model presents a system-level approach that creates a clear separation of concerns between the system implementer and the application developer. The three key concepts that make up the MURAC model are a unified machine model, a unified instruction stream and a unified memory space. A simple programming model built upon these abstractions provides a consistent interface for interacting with the underlying machine to the user application. This programming model simplifies application partitioning between hardware and software and allows the easy integration of different execution models within the single control ow of a mixed-architecture application. The theoretical and practical trade-offs of the proposed model have been explored through the design of several systems. An instruction-accurate system simulator has been developed that supports the simulated execution of mixed-architecture applications. An embedded System-on-Chip implementation has been used to measure the overhead in hardware resources required to support the model, which was found to be minimal. An implementation of the model within an operating system on a tightly-coupled reconfigurable processor platform has been created. This implementation is used to extend the software scheduler to allow for the full support of mixed-architecture applications in a multitasking environment. Different scheduling strategies have been tested using this scheduler for mixed-architecture applications. The design and implementation of these systems has shown that a unified abstraction model for heterogeneous computers provides important usability benefits to system and application designers. These benefits are achieved through a consistent view of the multiple different architectures to the operating system and user applications. This allows them to focus on achieving their performance and efficiency goals by gaining the benefits of different execution models during runtime without the complex implementation details of the system-level synchronisation and coordination

    A configurable vector processor for accelerating speech coding algorithms

    Get PDF
    The growing demand for voice-over-packer (VoIP) services and multimedia-rich applications has made increasingly important the efficient, real-time implementation of low-bit rates speech coders on embedded VLSI platforms. Such speech coders are designed to substantially reduce the bandwidth requirements thus enabling dense multichannel gateways in small form factor. This however comes at a high computational cost which mandates the use of very high performance embedded processors. This thesis investigates the potential acceleration of two major ITU-T speech coding algorithms, namely G.729A and G.723.1, through their efficient implementation on a configurable extensible vector embedded CPU architecture. New scalar and vector ISAs were introduced which resulted in up to 80% reduction in the dynamic instruction count of both workloads. These instructions were subsequently encapsulated into a parametric, hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research and implementation of the vector datapath of this vector coprocessor which is tightly-coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies employed and the use of Electronic System Level (ESL) techniques to rapidly design SIMD datapaths

    Optimizing SIMD execution in HW/SW co-designed processors

    Get PDF
    SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version
    • …
    corecore