5,851 research outputs found
OFAR-CM: Efficient Dragonfly networks with simple congestion management
Dragonfly networks are appealing topologies for large-scale Data center and HPC networks, that provide high throughput with low diameter and moderate cost. However, they are prone to congestion under certain frequent traffic patterns that saturate specific network links. Adaptive non-minimal routing can be used to avoid such congestion. That kind of routing employs longer paths to circumvent local or global congested links. However, if a distance-based deadlock avoidance mechanism is employed, more Virtual Channels (VCs) are required, what increases design complexity and cost. OFAR (On-the-Fly Adaptive Routing) is a previously proposed routing that decouples VCs from deadlock avoidance, making local and global misrouting affordable. However, the severity of congestion with OFAR is higher, as it relies on an escape sub network with low bisection bandwidth. Additionally, OFAR allows for unlimited misroutings on the escape sub network, leading to unbounded paths in the network and long latencies. In this paper we propose and evaluate OFAR-CM, a variant of OFAR combined with a simple congestion management (CM) mechanism which only relies on local information, specifically the credit count of the output ports in the local router. With simple escape sub networks such as a Hamiltonian ring or a tree, OFAR outperforms former proposals with distance-based deadlock avoidance. Additionally, although long paths are allowed in theory, in practice packets arrive at their destination in a small number of hops. Altogether, OFAR-CM constitutes the first practicable mechanism to the date that supports both local and global misrouting in Dragonfly networks.The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. ERC-2012-Adg-321253-
RoMoL, the Spanish Ministry of Science under contracts TIN2010-21291-C02-02, TIN2012-34557, and by the European
HiPEAC Network of Excellence. M. García participated in this work while affiliated with the University of Cantabria.Peer ReviewedPostprint (author's final draft
Recommended from our members
Data-Mining Synthesised Schedulers for Hard Real-Time Systems
The analysis of hard real-time systems, traditionally performed using RMA/PCP or simulation, is nowadays also studied as a scheduler synthesis problem, where one automatically constructs a scheduler which can guarantee avoidance of deadlock and deadline-miss system states. Even though this approach has the potential for a finer control of a hard real-time system, using fewer resources and easily adapting to further quality aspects (memory/energy consumption, jitter minimisation, etc.), synthesised schedulers are usually extremely large and difficult to understand. Their big size is a consequence of their inherent precision, since they attempt to describe exactly the frontier among the safe and unsafe system states. It nevertheless hinders their application in practise, since it is extremely difficult to validate them or to use them for better understanding the behaviour of the system. In this paper, we show how one can adapt data-mining techniques to decrease the size of a synthesised scheduler and force its inherent structure to appear, thus giving the system designer a wealth of additional information for understanding and optimising the scheduler and the underlying system. We present, in particular, how it can be used for obtaining hints for a good task distribution to different processing units, for optimising the scheduler itself (sometimes even removing it altogether in a safe manner) and obtaining both per-task and per-system views of the schedulability of the system
APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters
Many scientific computations need multi-node parallelism for matching up both
space (memory) and time (speed) ever-increasing requirements. The use of GPUs
as accelerators introduces yet another level of complexity for the programmer
and may potentially result in large overheads due to the complex memory
hierarchy. Additionally, top-notch problems may easily employ more than a
Petaflops of sustained computing power, requiring thousands of GPUs
orchestrated with some parallel programming model. Here we describe APEnet+,
the new generation of our interconnect, which scales up to tens of thousands of
nodes with linear cost, thus improving the price/performance ratio on large
clusters. The project target is the development of the Apelink+ host adapter
featuring a low latency, high bandwidth direct network, state-of-the-art wire
speeds on the links and a PCIe X8 gen2 host interface. It features hardware
support for the RDMA programming model and experimental acceleration of GPU
networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI
library driver are available, allowing for painless porting of standard
applications. Finally, we give an insight of future work and intended
developments
Composable Models for Timing and Liveness Analysis in Distributed Real-Time Embedded Systems Middleware
Middleware for distributed real-time embedded (DRE) systems has grown increasingly complex, to address functional and temporal requirements of diverse applications. While current approaches to modeling middleware have eased the task of assembling, deploying and configuring middleware and the applications that use it, a lower-level set of formal models is needed to uncover subtle timing and liveness hazards introduced by interference between and within distributed computations, particularly in the face of alternative middleware concurrency strategies. In this paper, we propose timed automata as a formal model of low-level middleware building blocks from which a variety different middleware configurations can be constructed. When combined with analysis techniques such as model checking, this formal model can help developers in verifying the correctness of various middleware configurations with respect to the timing and liveness constraints of each particular application
- …