2,278 research outputs found

    OFAR-CM: Efficient Dragonfly networks with simple congestion management

    Get PDF
    Dragonfly networks are appealing topologies for large-scale Data center and HPC networks, that provide high throughput with low diameter and moderate cost. However, they are prone to congestion under certain frequent traffic patterns that saturate specific network links. Adaptive non-minimal routing can be used to avoid such congestion. That kind of routing employs longer paths to circumvent local or global congested links. However, if a distance-based deadlock avoidance mechanism is employed, more Virtual Channels (VCs) are required, what increases design complexity and cost. OFAR (On-the-Fly Adaptive Routing) is a previously proposed routing that decouples VCs from deadlock avoidance, making local and global misrouting affordable. However, the severity of congestion with OFAR is higher, as it relies on an escape sub network with low bisection bandwidth. Additionally, OFAR allows for unlimited misroutings on the escape sub network, leading to unbounded paths in the network and long latencies. In this paper we propose and evaluate OFAR-CM, a variant of OFAR combined with a simple congestion management (CM) mechanism which only relies on local information, specifically the credit count of the output ports in the local router. With simple escape sub networks such as a Hamiltonian ring or a tree, OFAR outperforms former proposals with distance-based deadlock avoidance. Additionally, although long paths are allowed in theory, in practice packets arrive at their destination in a small number of hops. Altogether, OFAR-CM constitutes the first practicable mechanism to the date that supports both local and global misrouting in Dragonfly networks.The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. ERC-2012-Adg-321253- RoMoL, the Spanish Ministry of Science under contracts TIN2010-21291-C02-02, TIN2012-34557, and by the European HiPEAC Network of Excellence. M. García participated in this work while affiliated with the University of Cantabria.Peer ReviewedPostprint (author's final draft

    APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

    Full text link
    We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera FPGA, are provided.Comment: 6 pages, 7 figures, proceeding of CHEP 2010, Taiwan, October 18-2

    APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters

    Full text link
    Many scientific computations need multi-node parallelism for matching up both space (memory) and time (speed) ever-increasing requirements. The use of GPUs as accelerators introduces yet another level of complexity for the programmer and may potentially result in large overheads due to the complex memory hierarchy. Additionally, top-notch problems may easily employ more than a Petaflops of sustained computing power, requiring thousands of GPUs orchestrated with some parallel programming model. Here we describe APEnet+, the new generation of our interconnect, which scales up to tens of thousands of nodes with linear cost, thus improving the price/performance ratio on large clusters. The project target is the development of the Apelink+ host adapter featuring a low latency, high bandwidth direct network, state-of-the-art wire speeds on the links and a PCIe X8 gen2 host interface. It features hardware support for the RDMA programming model and experimental acceleration of GPU networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI library driver are available, allowing for painless porting of standard applications. Finally, we give an insight of future work and intended developments

    A Concurrency-Optimal Binary Search Tree

    Full text link
    The paper presents the first \emph{concurrency-optimal} implementation of a binary search tree (BST). The implementation, based on a standard sequential implementation of an internal tree, ensures that every \emph{schedule} is accepted, i.e., interleaving of steps of the sequential code, unless linearizability is violated. To ensure this property, we use a novel read-write locking scheme that protects tree \emph{edges} in addition to nodes. Our implementation outperforms the state-of-the art BSTs on most basic workloads, which suggests that optimizing the set of accepted schedules of the sequential code can be an adequate design principle for efficient concurrent data structures

    Application-Aware Deadlock-Free Oblivious Routing

    Get PDF
    Conventional oblivious routing algorithms are either not application-aware or assume that each flow has its own private channel to ensure deadlock avoidance. We present a framework for application-aware routing that assures deadlock-freedom under one or more channels by forcing routes to conform to an acyclic channel dependence graph. Arbitrary minimal routes can be made deadlock-free through appropriate static channel allocation when two or more channels are available. Given bandwidth estimates for flows, we present a mixed integer-linear programming (MILP) approach and a heuristic approach for producing deadlock-free routes that minimize maximum channel load. The heuristic algorithm is calibrated using the MILP algorithm and evaluated on a number of benchmarks through detailed network simulation. Our framework can be used to produce application-aware routes that target the minimization of latency, number of flows through a link, bandwidth, or any combination thereof

    Application-Aware Deadlock-Free Oblivious Routing

    Get PDF
    Conventional oblivious routing algorithms are either not application-aware or assume that each flow has its own private channel to ensure deadlock avoidance. We present a framework for application-aware routing that assures deadlock-freedom under one or more channels by forcing routes to conform to an acyclic channel dependence graph. Arbitrary minimal routes can be made deadlock-free through appropriate static channel allocation when two or more channels are available. Given bandwidth estimates for flows, we present a mixed integer-linear programming (MILP) approach and a heuristic approach for producing deadlock-free routes that minimize maximum channel load. The heuristic algorithm is calibrated using the MILP algorithm and evaluated on a number of benchmarks through detailed network simulation. Our framework can be used to produce application-aware routes that target the minimization of latency, number of flows through a link, bandwidth, or any combination thereof

    A method of computation for worst-case delay analysis on SpaceWire networks

    Get PDF
    SpaceWire is a standard for on-board satellite networks chosen by the ESA as the basis for future data-handling architectures. However, network designers need tools to ensure that the network is able to deliver critical messages on time. Current research only seek to determine probabilistic results for end-to-end delays on Wormhole networks like SpaceWire. This does not provide sufficient guarantee for critical traffic. Thus, in this paper, we propose a method to compute an upper-bound on the worst-case end-to-end delay of a packet in a SpaceWire network

    Static virtual channel allocation in oblivious routing

    Get PDF
    Most virtual channel routers have multiple virtual channels to mitigate the effects of head-of-line blocking. When there are more flows than virtual channels at a link, packets or flows must compete for channels, either in a dynamic way at each link or by static assignment computed before transmission starts. In this paper, we present methods that statically allocate channels to flows at each link when oblivious routing is used, and ensure deadlock freedom for arbitrary minimal routes when two or more virtual channels are available. We then experimentally explore the performance trade-offs of static and dynamic virtual channel allocation for various oblivious routing methods, including DOR, ROMM, Valiant and a novel bandwidth-sensitive oblivious routing scheme (BSORM). Through judicious separation of flows, static allocation schemes often exceed the performance of dynamic allocation schemes

    Worst-case end-to-end delays evaluation for SpaceWire networks

    Get PDF
    SpaceWire is a standard for on-board satellite networks chosen by the ESA as the basis for multiplexing payload and control traffic on future data-handling architectures. However, network designers need tools to ensure that the network is able to deliver critical messages on time. Current research fails to address this needs for SpaceWire networks. On one hand, many papers only seek to determine probabilistic results for end-to-end delays on Wormhole networks like SpaceWire. This does not provide sufficient guarantee for critical traffic. On the other hand, a few papers give methods to determine maximum latencies on wormhole networks that, unlike SpaceWire, have dedicated real-time mechanisms built-in. Thus, in this paper, we propose an appropriate method to compute an upper-bound on the worst-case end-to-end delay of a packet in a SpaceWire network
    corecore