2,384 research outputs found

    APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters

    Full text link
    Many scientific computations need multi-node parallelism for matching up both space (memory) and time (speed) ever-increasing requirements. The use of GPUs as accelerators introduces yet another level of complexity for the programmer and may potentially result in large overheads due to the complex memory hierarchy. Additionally, top-notch problems may easily employ more than a Petaflops of sustained computing power, requiring thousands of GPUs orchestrated with some parallel programming model. Here we describe APEnet+, the new generation of our interconnect, which scales up to tens of thousands of nodes with linear cost, thus improving the price/performance ratio on large clusters. The project target is the development of the Apelink+ host adapter featuring a low latency, high bandwidth direct network, state-of-the-art wire speeds on the links and a PCIe X8 gen2 host interface. It features hardware support for the RDMA programming model and experimental acceleration of GPU networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI library driver are available, allowing for painless porting of standard applications. Finally, we give an insight of future work and intended developments

    Range-Point Migration-Based Image Expansion Method Exploiting Fully Polarimetric Data for UWB Short-Range Radar

    Get PDF
    Ultrawideband radar with high-range resolution is a promising technology for use in short-range 3-D imaging applications, in which optical cameras are not applicable. One of the most efficient 3-D imaging methods is the range-point migration (RPM) method, which has a definite advantage for the synthetic aperture radar approach in terms of computational burden, high accuracy, and high spatial resolution. However, if an insufficient aperture size or angle is provided, these kinds of methods cannot reconstruct the whole target structure due to the absence of reflection signals from large part of target surface. To expand the 3-D image obtained by RPM, this paper proposes an image expansion method by incorporating the RPM feature and fully polarimetric data-based machine learning approach. Following ellipsoid-based scattering analysis and learning with a neural network, this method expresses the target image as an aggregation of parts of ellipsoids, which significantly expands the original image by the RPM method without sacrificing the reconstruction accuracy. The results of numerical simulation based on 3-D finite-difference time-domain analysis verify the effectiveness of our proposed method, in terms of image-expansion criteria

    Mapping applications with collectives over sub-communicators on torus networks

    Get PDF
    pre-printThe placement of tasks in a parallel application on specific nodes of a supercomputer can significantly impact performance. Traditionally, this task mapping has focused on reducing the distance between communicating tasks on the physical network. This minimizes the number of hops that point-to-point messages travel and thus reduces link sharing between messages and contention. However, for applications that use collectives over sub-communicators, this heuristic may not be optimal. Many collectives can benefit from an increase in bandwidth even at the cost of an increase in hop count, especially when sending large messages. For example, placing communicating tasks in a cube configuration rather than a plane or a line on a torus network increases the number of possible paths messages might take. This increases the available bandwidth which can lead to significant performance gains. We have developed Rubik, a tool that provides a simple and intuitive interface to create a wide variety of mappings for structured communication patterns. Rubik supports a number of elementary operations such as splits, tilts, or shifts, that can be combined into a large number of unique patterns. Each operation can be applied to disjoint groups of processes involved in collectives to increase the effective bandwidth. We demonstrate the use of Rubik for improving performance of two parallel codes, pF3D and Qbox, which use collectives over sub-communicators

    Task mapping on a dragonfly supercomputer

    Full text link
    The dragonfly network topology has recently gained traction in the design of high performance computing (HPC) systems and has been implemented in large-scale supercomputers. The impact of task mapping, i.e., placement of MPI ranks onto compute cores, on the communication performance of applications on dragonfly networks has not been comprehensively investigated on real large-scale systems. This paper demonstrates that task mapping affects the communication overhead significantly in dragonflies and the magnitude of this effect is sensitive to the application, job size, and the OpenMP settings. Among the three task mapping algorithms we study (in-order, random, and recursive coordinate bisection), selecting a suitable task mapper reduces application communication time by up to 47%

    Visualizing network traffic to understand the performance of massively parallel simulations

    Get PDF
    pre-printThe performance of massively parallel applications is often heavily impacted by the cost of communication among compute nodes. However, determining how to best use the network is a formidable task, made challenging by the ever increasing size and complexity of modern supercomputers. This paper applies visualization techniques to aid parallel application developers in understanding the network activity by enabling a detailed exploration of the flow of packets through the hardware interconnect. In order to visualize this large and complex data, we employ two linked views of the hardware network. The first is a 2D view, that represents the network structure as one of several simplified planar projections. This view is designed to allow a user to easily identify trends and patterns in the network traffic. The second is a 3D view that augments the 2D view by preserving the physical network topology and providing a context that is familiar to the application developers. Using the massively parallel multi-physics code pF3D as a case study, we demonstrate that our tool provides valuable insight that we use to explain and optimize pF3D's performance on an IBM Blue Gene/P system

    Space Shuffle: A Scalable, Flexible, and High-Bandwidth Data Center Network

    Full text link
    Data center applications require the network to be scalable and bandwidth-rich. Current data center network architectures often use rigid topologies to increase network bandwidth. A major limitation is that they can hardly support incremental network growth. Recent work proposes to use random interconnects to provide growth flexibility. However routing on a random topology suffers from control and data plane scalability problems, because routing decisions require global information and forwarding state cannot be aggregated. In this paper we design a novel flexible data center network architecture, Space Shuffle (S2), which applies greedy routing on multiple ring spaces to achieve high-throughput, scalability, and flexibility. The proposed greedy routing protocol of S2 effectively exploits the path diversity of densely connected topologies and enables key-based routing. Extensive experimental studies show that S2 provides high bisectional bandwidth and throughput, near-optimal routing path lengths, extremely small forwarding state, fairness among concurrent data flows, and resiliency to network failures
    • …
    corecore