5 research outputs found

    Navigating An Evolutionary Fast Path to Exascale.

    No full text
    Abstract-The computing community is in the midst of a disruptive architectural change. The advent of manycore and heterogeneous computing nodes forces us to reconsider every aspect of the system software and application stack. To address this challenge there is a broad spectrum of approaches, which we roughly classify as either revolutionary or evolutionary. With the former, the entire code base is re-written, perhaps using a new programming language or execution model. The latter, which is the focus of this work, seeks a piecewise path of effective incremental change. The end effect of our approach will be revolutionary in that the control structure of the application will be markedly different in order to utilize single-instruction multipledata/thread (SIMD/SIMT), manycore and heterogeneous nodes, but the physics code fragments will be remarkably similar. Our approach is guided by a set of mission driven applications and their proxies, focused on balancing performance potential with the realities of existing application code bases. Although the specifics of this process have not yet converged, we find that there are several important steps that developers of scientific and engineering application programs can take to prepare for making effective use of these challenging platforms. Aiding an evolutionary approach is the recognition that the performance potential of the architectures is, in a meaningful sense, an extension of existing capabilities: vectorization, threading, and a re-visiting of node interconnect capabilities. Therefore, as architectures, programming models, and programming mechanisms continue to evolve, the preparations described herein will provide significant performance benefits on existing and emerging architectures

    Network-accelerated non-contiguous memory transfers

    No full text
    Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate which tasks to offload to the NIC: In this work we argue that non-contiguous memory transfers can be transparently network-accelerated, truly achieving zero-copy communications. We implement and extend sPIN, a packet streaming processor, within a Portals 4 NIC SST model, and evaluate strategies for NIC-offloaded processing of MPI datatypes, ranging from datatype-specific handlers to general solutions for any MPI datatype. We demonstrate up to 8x speedup in the unpack throughput of real applications, demonstrating that non-contiguous memory transfers are a first-class candidate for network acceleration

    Shared memory programming on the Meiko CS-2

    No full text
    A interesting feature of some recent parallel computers is the fact that the underlying transport mechanism behind the currently dominating message passing interfaces is based on a global address space model. By accessing this global adress space directly most of the inherent delays for administering message buffers and queues can be avoided. Using this interface we have implemented a user level distributed shared memory layer using the virtual memory protection mechanisms of the operating system. The synchronisation required for maintaining the coherency of the memory is addressed by implementing a distributed shared lock which exploits the remote atomic store operations provided by the Meiko CS-2. This allows an asynchronous stype of programming where the load is dynamically distributed over the nodes of a parallel partition. (orig.)SIGLEAvailable from TIB Hannover: RO 3476(180) / FIZ - Fachinformationszzentrum Karlsruhe / TIB - Technische InformationsbibliothekDEGerman
    corecore