24 research outputs found

    Randomized Routing on Fat-trees

    Get PDF
    Fat-trees are a class of routing networks for hardware-efficient parallel computation. This paper presents a randomized algorithm for routing messages on a fat-tree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the time required to deliver the messages. We show that if a set of messages has load factor lambda on a fat-tree with n processors, the number of delivery cycles (routing attempts) that the algorithm requires is O(lambda+lgnlglgn) with probability 1-O(1/n). The best previous bound was O(lambdalgn) for the off-line problem in which the set of messages is known in advance. In the context of a VLSI model that equates hardware cost with physical volume, the routing algorithm can be used to demonstrate that fat-trees are universal routing networks. Specifically, we prove that any routing network can be efficiently simulated by a fat-tree of comparable hardware cost

    Randomized Routing on Fat-Trees

    Get PDF
    Fat-trees are a class of routing networks for hardware-efficient parallel computation. This paper presents a randomized algorithm for routing messages on a fat-tree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the time required to deliver the messages. We show that if a set of messages has load factor lambda on a fat-tree with n processors, the number of delivery cycles (routing attempts) that the algorithm requires is O(lambda + lg n lg lg n) with probability 1-O(1/n). The best previous bound was O(lambda lg n) for the offline problem in which the set of messages is known in advance. In the context of a VLSI model that equates hardware cost with physical volume, the routing algorithm can be used to demonstrate that fat-trees are universal routing networks. Specifically, we prove that any routing network can be efficiently simulated by a fat-tree of comparable hardware cost

    Adversarial Analyses of Window Backoff Strategies for Simple Multiple-Access Channels

    Get PDF
    Backoff strategies have typically been analyzed by making statistical assumptions on the distribution of problem inputs. Although these analyses have provided valuable insights into the efficacy of various backoff strategies, they leave open the question as to which backoff algorithms perform best in the worst case or on inputs, such as bursty inputs, that are not covered by the statistical models. This paper analyzes randomized backoff strategies using worst-case assumptions on the inputs. Specifically, we analyze algorithms for simple multiple-access channels, where the only feedback from each attempt to send a packet is a single bit indicating whether the transmission succeeded or the packet collided with another packet. We analyze a class of strategies, called window strategies, where each packet partitions time into a sequence (W₁, W₂,...) of windows. Within each window, the packet makes an access attempt during a single randomly selected slot. If its transmission is unsuccessful, it waits for its slot in the next window before retrying. We use delay-sequence arguments to show that for the batch problem, in which n packets all arrive at time 0, if every window has size W = Θ(n), then with high probability, all packets successfully transmit with makespan n lg lg n ± O(n). We use this result to analyze window backoff strategies with varying window sizes. Specifically, we show that the familiar binary exponential backoff algorithm, where Wk = Θ(2k), has makespan Θ(n lg n), and that more generally, for any constant r > 1, the r-exponential backoff algorithm in which Wk = Θ(rk) has makespan Θ(n lglg rn). We also show that for any constant r > 1, the r-polynomial backoff algorithm, in which Wk = Θ(kr), has makespan Θ((n/lg n)¹⁺¹/r). All of these batch strategies are monotonic, in the sense that the window size monotonically increases over time. We exhibit a monotonic backoff algorithm that achieves makespan Θ(n lg lg n/lg lg lg n). We prove that this algorithm, whose backoff is superpolynomial and subexponential, is optimal over all monotonic backoff schemes. In addition, we exhibit a simple backoff/backon algorithm, having window sizes that vary nonmonotonically according to a "sawtooth" pattern, that achieves the optimal makespan of Θ(n). We study the online setting using an adversarial queueing model. We define a (λ,T)-stream to be an input stream of packets in which at most n = λT packets arrive during any time interval of size T. In this model, to evaluate a given backoff algorithm (which does not know λ or T), we analyze the worst-case behavior of the algorithm over the class of (λ,T)-streams. Our results for the online setting focus on exponential backoff. We show that for any arrival rate λ, there exists a sufficiently large interval size T such that the throughput goes to 0 for some (λ,T)-stream. Moreover, there exists a sufficiently large constant c such that for any interval size T, if λ ⥠c lg lg n/lg n, the system is unstable in the sense that the arrival rate exceeds the throughput in the worst case. If, on the other hand, we have λ ⤠c/lg n for a sufficiently small constant c, then the system is stable. Surprisingly, the algorithms that guarantee smaller makespans in the batch setting require lower arrival rates to achieve stability than does exponential backoff, but when they are stable, they have better response times.Singapore-MIT Alliance (SMA

    CLEX: Yet Another Supercomputer Architecture?

    Get PDF
    We propose the CLEX supercomputer topology and routing scheme. We prove that CLEX can utilize a constant fraction of the total bandwidth for point-to-point communication, at delays proportional to the sum of the number of intermediate hops and the maximum physical distance between any two nodes. Moreover, % applying an asymmetric bandwidth assignment to the links, all-to-all communication can be realized (1+o(1))(1+o(1))-optimally both with regard to bandwidth and delays. This is achieved at node degrees of nεn^{\varepsilon}, for an arbitrary small constant ε(0,1]\varepsilon\in (0,1]. In contrast, these results are impossible in any network featuring constant or polylogarithmic node degrees. Through simulation, we assess the benefits of an implementation of the proposed communication strategy. Our results indicate that, for a million processors, CLEX can increase bandwidth utilization and reduce average routing path length by at least factors 1010 respectively 55 in comparison to a torus network. Furthermore, the CLEX communication scheme features several other properties, such as deadlock-freedom, inherent fault-tolerance, and canonical partition into smaller subsystems

    Unifying mesh- and tree-based programmable interconnect

    Get PDF
    We examine the traditional, symmetric, Manhattan mesh design for field-programmable gate-array (FPGA) routing along with tree-of-meshes (ToM) and mesh-of-trees (MoT) based designs. All three networks can provide general routing for limited bisection designs (Rent's rule with p<1) and allow locality exploitation. They differ in their detailed topology and use of hierarchy. We show that all three have the same asymptotic wiring requirements. We bound this tightly by providing constructive mappings between routes in one network and routes in another. For example, we show that a (c,p) MoT design can be mapped to a (2c,p) linear population ToM and introduce a corner turn scheme which will make it possible to perform the reverse mapping from any (c,p) linear population ToM to a (2c,p) MoT augmented with a particular set of corner turn switches. One consequence of this latter mapping is a multilayer layout strategy for N-node, linear population ToM designs that requires only /spl Theta/(N) two-dimensional area for any p when given sufficient wiring layers. We further show upper and lower bounds for global mesh routes based on recursive bisection width and show these are within a constant factor of each other and within a constant factor of MoT and ToM layout area. In the process we identify the parameters and characteristics which make the networks different, making it clear there is a unified design continuum in which these networks are simply particular regions

    Is Our Model for Contention Resolution Wrong?

    Full text link
    Randomized binary exponential backoff (BEB) is a popular algorithm for coordinating access to a shared channel. With an operational history exceeding four decades, BEB is currently an important component of several wireless standards. Despite this track record, prior theoretical results indicate that under bursty traffic (1) BEB yields poor makespan and (2) superior algorithms are possible. To date, the degree to which these findings manifest in practice has not been resolved. To address this issue, we examine one of the strongest cases against BEB: nn packets that simultaneously begin contending for the wireless channel. Using Network Simulator 3, we compare against more recent algorithms that are inspired by BEB, but whose makespan guarantees are superior. Surprisingly, we discover that these newer algorithms significantly underperform. Through further investigation, we identify as the culprit a flawed but common abstraction regarding the cost of collisions. Our experimental results are complemented by analytical arguments that the number of collisions -- and not solely makespan -- is an important metric to optimize. We believe that these findings have implications for the design of contention-resolution algorithms.Comment: Accepted to the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2017

    The Fat-Pyramid and Universal Parallel Computation Independent of Wire Delay

    Get PDF
    This paper shows that a fat-pyramid of area Θ(A) requires only O(log A) slowdown to simulate any competing network of area A under very general conditions. The result holds regardless of the processor size (amount of attached memory) and number of processors in the competing networks as long as the limitation on total area is met. Furthermore, the result is valid regardless of the relationship between wire length and wire delay. We especially focus on elimination of the common simplifying assumption that unit time suffices to traverse a wire regardless of its length, since the assumption becomes more and more untenable as the size of parallel systems increases. This paper concentrates on simulation using transmission lines (wires along which bits can be pipelined) with the message routing schedule set up off line, but it also discusses the extension to on-line simulation. This paper also examines the capabilities of a fat-pyramid when matched against a substantially larger network and points out the surprising difficulty of doing such a comparison without the unit wire delay assumption

    Packet Switched vs. Time Multiplexed FPGA Overlay Networks

    Get PDF
    Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require a large number of distinct PE-to-PE connections (e.g. using one PE to simulate 100s of operators, each requiring input data from thousands of other operators), but with each connection having low throughput compared with the PE’s operating cycle time. In these highly interconnected conditions, dedicating spatial interconnect resources for all possible connections is costly and inefficient. Alternatively, we can time share physical network resources by virtualizing interconnect links, either by statically scheduling the sharing of resources prior to runtime or by dynamically negotiating resources at runtime. We explore the tradeoffs (e.g. area, route latency, route quality) between time-multiplexed and packet-switched networks overlayed on top of commodity FPGAs. We demonstrate modular and scalable networks which operate on a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed, offline scheduling offers up to a 63% performance increase over online, packet-switched scheduling for equivalent topologies. When applying designs to equivalent area, packet-switching is up to 2× faster for small area designs while time-multiplexing is up to 5× faster for larger area designs. When limited to the capacity of a XC2V6000, if all communication is known, time-multiplexed routing outperforms packet-switching; however when the active set of links drops below 40% of the potential links, packet-switched routing can outperform time-multiplexing

    Singletons for Simpletons: Revisiting Windowed Backoff with Chernoff Bounds

    Get PDF
    Backoff algorithms are used in many distributed systems where multiple devices contend for a shared resource. For the classic balls-into-bins problem, the number of singletons - those bins with a single ball - is important to the analysis of several backoff algorithms; however, existing analyses employ advanced probabilistic tools to obtain concentration bounds. Here, we show that standard Chernoff bounds can be used instead, and the simplicity of this approach is illustrated by re-analyzing some well-known backoff algorithms
    corecore