5 research outputs found

    GPU peer-to-peer techniques applied to a cluster interconnect

    Full text link
    Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201

    G-PETo: Um Framework para Troca Direta e Transparente de Dados entre Aplicações Filtro-Fluxo em Arquiteturas Multi-GPUs

    Get PDF
    Neste trabalho propomos e avaliamos uma abstração genérica para a troca de dados direta e transparente em aplicações filtro-fluxo executadas em cluster (computadores interligados) heterogêneos, compostos por múltiplas placas aceleradoras gráficas (GPUs). Esta abstração permite que todos os detalhes de implementação de baixo nível, relacionados à comunicação entre GPUs e o controle relacionado à localização dos filtros, sejam realizados de forma transparente para os programadores. Este trabalho está consolidado em um framework, o qual denominamos G-PETo, e nossos resultados de avaliação demonstram que ele é capaz de fornecer uma camada de abstração aos programadores sem comprometer o desempenho geral da aplicação

    Solutions for the optimization of the software interface on an FPGA-based NIC

    Get PDF
    The theme of the research is the study of solutions for the optimization of the software interface on FPGA-based Network Interface Cards. The research activity was carried out in the APE group at INFN (Istituto Nazionale di Fisica Nucleare), which has been historically active in designing of high performance scalable networks for hybrid nodes (CPU/GPU) clusters. The result of the research is validated on two projects the APE group is currently working on, both allowing fast prototyping for solutions and hardware-software co-design: APEnet (a PCIe FPGA-based 3D torus network controller) and NaNet (FPGA-based family of NICs mainly dedicated to real-time, low-latency computing systems such as fast control systems or High Energy Physics Data Acquisition Systems). NaNet is also used to validate a GPU-controlled device driver to improve network perfomances, i.e. even lower latency of the communication, while used in combination with existing user-space software. This research is also gaining results in the "Horizon2020 FET-HPC ExaNeSt project", which aims to prototype and develop solutions for some of the crucial problems on the way towards production of Exascale-level Supercomputers, where the APE group is actively contribuiting to the development of the network / interconnection infrastructure

    Communication Architectures for Scalable GPU-centric Computing Systems

    Get PDF
    In recent years, power consumption has become the main concern in High Performance Computing (HPC). This has lead to heterogeneous computing systems in which Central Processing Units (CPUs) are supported by accelerators, such as Graphics Processing Units (GPUs). While GPUs used to be seen as slave devices to which the main processor offloads computation, today’s systems tend to deploy more GPUs than CPUs. Eventually, the GPU will become a first-class processor, bearing increasing responsibilities. Promoting the GPU to a first-class processor comes with many challenges, such as progress guarantees, dynamic memory management, and scheduling. However, one of the main challenges is the GPU’s inability to orchestrate communication, which is currently entirely handled by the CPU. This work addresses that issue and presents solutions to allow GPUs to source and sink network traffic independently. Many important aspects are addressed, ranging from the application level to how networking hardware is accessed. First, important and large scale exascale applications are studied to further understand their communication behavior and applications’ requirements. Several metrics are presented, including time spent for communication, message sizes, and the length of queues that are required to match messages with receive requests. One aspect the analysis revealed is that messages are becoming smaller at scale, which renders the matching of messages and receive requests an important problem to address. The next part analyzes how the GPU can directly access the network with various communication models being presented and benchmarked. It is shown that a flat address space of distributed GPU memories shows superior bandwidth than put/get communication or CPU-controlled message passing, but less communication can be overlapped with computation. Overall, GPU-controlled communication is always superior, both in terms of time-to-solution and energy spending. The final part addresses communication management on GPUs, which is required to provide high-level communication abstractions. Besides other fundamental building blocks, an algorithm for the message matching is presented that yields similar performance as CPUs. However, it is also shown that the messaging protocol can be relaxed to improve performance significantly, leveraging the massive amount of parallelism provided by the GPU’s architecture