341 research outputs found

    Packet Dispatching Schemes for Three-Stage Buffered Clos-Network Switches

    Get PDF
    Non

    Load balancing and scalable clos-network packet switches

    Get PDF
    In this dissertation three load-balancing Clos-network packet switches that attain 100% throughput and forward cells in sequence are introduced. The configuration schemes and the in-sequence forwarding mechanisms devised for these switches are also introduced. Also proposed is the use of matrix analysis as a tool for throughput analysis. In Chapter 2, a configuration scheme for a load-balancing Clos-network packet switch that has split central modules and buffers in between the split modules is introduced. This switch is called split-central-buffered Load-Balancing Clos-network (LBC) switch and it is cell based. The switch has four stages, namely input, central-input, central-output, and output stages. The proposed configuration scheme uses a pre-determined and periodic interconnection pattern in the input and split central modules to load-balance and route traffic. The LBC switch has low configuration complexity. The operation of the switch includes a mechanism applied at input and split-central modules to forward cells in sequence. The switch achieves 100% throughput under uniform and nonuniform admissible traffic with independent and identical distributions (i.i.d.). The high switching performance and low complexity of the switch are achieved while performing in-sequence forwarding and without resorting to memory speedup or central-stage expansion. This discussion includes both throughput analysis, where the operations that the configuration mechanism performs on the traffic traversing the switch are described, and a proof of in-sequence forwarding. Simulation analysis is presented as a practical demonstration of the switch performance on uniform and nonuniform i.i.d. traffic.In Chapter 3, a three-stage load balancing packet switch and its configuration scheme are introduced. The input- and central-stage switches are bufferless crossbars and the output-stage switches are buffered crossbars. This switch is called ThRee-stage Clos-network swItch and has queues at the middle stage and DEtermiNisTic scheduling (TRIDENT) and it is cell based. The proposed configuration scheme uses a pre-determined and periodic interconnection pattern in the input and central modules to load-balance and route traffic; therefore, it has low configuration complexity. The operation of the switch includes a mechanism applied at input and output modules to forward cells in sequence. In Chapter 4, a highly scalable load balancing three-stage Clos-network switch with Virtual Input-module output queues at ceNtral stagE (VINE) and crosspoint-buffers at output modules and its configuration scheme are introduced. VINE uses space switching in the first stage and buffered crossbars in the second and third stages. The proposed configuration scheme uses pre-determined and periodic interconnection patterns in the input modules for load balancing. The mechanism applied at the inputs, used to forward cells in sequence, is also introduced. VINE achieves 100% throughput under uniform and nonuniform admissible i.i.d. traffic. VINE achieves high switching performance, low configuration complexity, and in-sequence forwarding without resorting to memory speedup. In Chapter 5, matrix analysis is introduced as a tool for modeling, describing the internal operations, and analyzing the throughput of a packet switch

    Non-minimal adaptive routing for efficient interconnection networks

    Get PDF
    RESUMEN: La red de interconexión es un concepto clave de los sistemas de computación paralelos. El primer aspecto que define una red de interconexión es su topología. Habitualmente, las redes escalables y eficientes en términos de coste y consumo energético tienen bajo diámetro y se basan en topologías que encaran el límite de Moore y en las que no hay diversidad de caminos mínimos. Una vez definida la topología, quedando implícitamente definidos los límites de rendimiento de la red, es necesario diseñar un algoritmo de enrutamiento que se acerque lo máximo posible a esos límites y debido a la ausencia de caminos mínimos, este además debe explotar los caminos no mínimos cuando el tráfico es adverso. Estos algoritmos de enrutamiento habitualmente seleccionan entre rutas mínimas y no mínimas en base a las condiciones de la red. Las rutas no mínimas habitualmente se basan en el algoritmo de balanceo de carga propuesto por Valiant, esto implica que doblan la longitud de las rutas mínimas y por lo tanto, la latencia soportada por los paquetes se incrementa. En cuanto a la tecnología, desde su introducción en entornos HPC a principios de los años 2000, Ethernet ha sido usado en un porcentaje representativo de los sistemas. Esta tesis introduce una implementación realista y competitiva de una red escalable y sin pérdidas basada en dispositivos de red Ethernet commodity, considerando topologías de bajo diámetro y bajo consumo energético y logrando un ahorro energético de hasta un 54%. Además, propone un enrutamiento sobre la citada arquitectura, en adelante QCN-Switch, el cual selecciona entre rutas mínimas y no mínimas basado en notificaciones de congestión explícitas. Una vez implementada la decisión de enrutar siguiendo rutas no mínimas, se introduce un enrutamiento adaptativo en fuente capaz de adaptar el número de saltos en las rutas no mínimas. Este enrutamiento, en adelante ACOR, es agnóstico de la topología y mejora la latencia en hasta un 28%. Finalmente, se introduce un enrutamiento dependiente de la topología, en adelante LIAN, que optimiza el número de saltos de las rutas no mínimas basado en las condiciones de la red. Los resultados de su evaluación muestran que obtiene una latencia cuasi óptima y mejora el rendimiento de algoritmos de enrutamiento actuales reduciendo la latencia en hasta un 30% y obteniendo un rendimiento estable y equitativo.ABSTRACT: Interconnection network is a key concept of any parallel computing system. The first aspect to define an interconnection network is its topology. Typically, power and cost-efficient scalable networks with low diameter rely on topologies that approach the Moore bound in which there is no minimal path diversity. Once the topology is defined, the performance bounds of the network are determined consequently, so a suitable routing algorithm should be designed to accomplish as much as possible of those limits and, due to the lack of minimal path diversity, it must exploit non-minimal paths when the traffic pattern is adversarial. These routing algorithms usually select between minimal and non-minimal paths based on the network conditions, where the non-minimal paths are built according to Valiant load-balancing algorithm. This implies that these paths double the length of minimal ones and then the latency supported by packets increases. Regarding the technology, from its introduction in HPC systems in the early 2000s, Ethernet has been used in a significant fraction of the systems. This dissertation introduces a realistic and competitive implementation of a scalable lossless Ethernet network for HPC environments considering low-diameter and low-power topologies. This allows for up to 54% power savings. Furthermore, it proposes a routing upon the cited architecture, hereon QCN-Switch, which selects between minimal and non-minimal paths per packet based on explicit congestion notifications instead of credits. Once the miss-routing decision is implemented, it introduces two mechanisms regarding the selection of the intermediate switch to develop a source adaptive routing algorithm capable of adapting the number of hops in the non-minimal paths. This routing, hereon ACOR, is topology-agnostic and improves average latency in all cases up to 28%. Finally, a topology-dependent routing, hereon LIAN, is introduced to optimize the number of hops in the non-minimal paths based on the network live conditions. Evaluations show that LIAN obtains almost-optimal latency and outperforms state-of-the-art adaptive routing algorithms, reducing latency by up to 30.0% and providing stable throughput and fairness.This work has been supported by the Spanish Ministry of Education, Culture and Sports under grant FPU14/02253, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2010-21291-C02-02, TIN2013-46957-C2-2-P, and TIN2013-46957-C2-2-P (AEI/FEDER, UE), the Spanish Research Agency under contract PID2019-105660RBC22/AEI/10.13039/501100011033, the European Union under agreements FP7-ICT-2011- 7-288777 (Mont-Blanc 1) and FP7-ICT-2013-10-610402 (Mont-Blanc 2), the University of Cantabria under project PAR.30.P072.64004, and by the European HiPEAC Network of Excellence through an internship grant supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. H2020-ICT-2015-687689

    Algorithmic aspects of high speed switching

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2002.Includes bibliographical references (p. 145-148).A major drawback of the traditional output queuing technique is that it requires a switch speedup of N, where N is the size of the switch. This dependence on N makes the switch non-scalable at high speeds. Input queuing has been suggested instead. The introduction of input queuing creates the necessity for developing switching algorithms to decide which packets to keep waiting at the input, and which packets to forward across the switch. In this thesis, we address various algorithmic aspects of switching. We prove in this thesis, that many of the practical switching algorithms still require a speedup to achieve even a weak notion of throughput. We propose two switching algorithms that belong to a family to which we refer in this thesis as priority switching. These two algorithms overcome some of the disadvantages in existing priority switching algorithms, such as the excessive amount of state information that needs to be maintained. We also develop a practical algorithm that belongs to a family to which we refer in this thesis as iterative switching. This algorithm achieves high throughput in practice and offers the advantage of not requiring more than one iteration, unlike other existing iterative switching algorithms which require multiple iterations to achieve high throughput. Finally, we address the issue of using switches in parallel to accommodate for the need of speedup. We study two settings of parallel switches, one with standard packet switching, and one with flow scheduling, in which flows cannot be split across multiple switches.by Saadeddine Mneimneh.Ph.D

    Bandwith allocation and scheduling in photonic networks

    Get PDF
    This thesis describes a framework for bandwidth allocation and scheduling in the Agile All-Photonic Network (AAPN). This framework is also applicable to any single-hop communication network with significant signalling delay (such as satellite-TDMA systems). Slot-by-slot scheduling approaches do not provide adequate performance for wide-area networks, so we focus on frame-based scheduling. We propose three novel fixed-length frame scheduling algorithms (Minimum Cost Search, Fair Matching and Minimum Rejection) and a feedback control system for stabilization.MCS is a greedy algorithm, which allocates time-slots sequentially using a cost function. This function is defined such that the time-slots with higher blocking probability are assigned first. MCS does not guarantee 100% throughput, thought it has a low blocking percentage. Our optimum scheduling approach is based on modifying the demand matrix such that the network resources are fully utilized, while the requests are optimally served. The Fair Matching Algorithm (FMA) uses the weighted max-min fairness criterion to achieve a fair share of resources amongst the connections in the network. When rejection is inevitable, FMA selects rejections such that the maximum percentage rejection experienced in the network is minimized. In another approach we formulate the rejection task as an optimization problem and propose the Minimum Rejection Algorithm (MRA), which minimizes total rejection. The minimum rejection problem is a special case of maximum flow problem. Due to the complexity of the algorithms that solve the max-flow problem we propose a heuristic algorithm with lower complexity.Scheduling in wide-area networks must be based on predictions of traffic demand and the resultant errors can lead to instability and unfairness. We design a feedback control system based on Smith's principle, which removes the destabilizing delays from the feedback loop by using a "loop cancelation" technique. The feedback control system we propose reduces the effect of prediction errors, increasing the speed of the response to sudden changes in traffic arrival rates and improving the fairness in the network through equalization of queue-lengths

    System architecture and hardware implementations for a reconfigurable MPLS router

    Get PDF
    With extremely wide bandwidth and good channel properties, optical fibers have brought fast and reliable data transmission to today’s data communications. However, to handle heavy traffic flowing through optical physical links, much faster processing speed is required or else congestion can take place at network nodes. Also, to provide people with voice, data and all categories of multimedia services, distinguishing between different data flows is a requirement. To address these router performance, Quality of Service /Class of Service and traffic engineering issues, Multi-Protocol Label Switching (MPLS) was proposed for IP-based Internetworks. In addition, routers flexible in hardware architecture in order to support ever-evolving protocols and services without causing big infrastructure modification or replacement are also desirable. Therefore, reconfigurable hardware implementation of MPLS was proposed in this project to obtain the overall fast processing speed at network nodes. The long-term goal of this project is to develop a reconfigurable MPLS router, which uniquely integrates the best features of operations being conducted in software and in run-time-reconfigurable hardware. The scope of this thesis includes system architecture and service algorithm considerations, Verilog coding and testing for an actual device. The hardware and software co-design technique was used to partition and schedule the protocol code for execution on both a general-purpose processor and stream-based hardware. A novel RPS scheme that is practically easy to build and can realize pipelined packet-by-packet data transfer at each output was proposed to take the place of the traditional crossbar switching. In RPS, packets with variable lengths can be switched intelligently without performing packet segmentation and reassembly. Primary theoretical analysis of queuing issues was discussed and an improved multiple queue service scheduling policy UD-WRR was proposed, which can reduce packet-waiting time without sacrificing the performance. In order to have the tests carried out appropriately, dedicated circuitry for the MPLS functional block to interface a specific MAC chip was implemented as well. The hardware designs for all functions were realized with a single Field Programmable Gate Array (FPGA) device in this project. The main result presented in this thesis was the MPLS function implementation realizing a major part of layer three routing at the reconfigurable hardware level, which advanced a great step towards the goal of building a router that is both fast and flexible

    Efficient scheduling algorithms for quality-of-service guarantees in the Internet

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.Includes bibliographical references (p. 167-172).The unifying theme of this thesis is the design of packet schedulers to provide quality-of- service (QoS) guarantees for various networking problem settings. There is a dual emphasis on both theoretical justification and simulation evaluation. We have worked on several widely different problem settings - optical networks, input-queued crossbar switches, and CDMA wireless networks - and we found that the same set of scheduling techniques can be applied successfully in all these cases to provide per-flow bandwidth, delay and max-min fairness guarantees. We formulated the abstract scheduling problems as a sum of two aspects. First, the particular problem setting imposes constraints which dictate what kinds of transmission patterns are allowed by the physical hardware resources, i.e., what are the feasible solutions. Second, the users require some form of QoS guarantees, which translate into optimality criteria judging the feasible solutions. The abstract problem is how to design an algorithm that finds an optimal (or near-optimal) solution among the feasible ones. Our schedulers are based on a credit scheme. Specifically, flows receive credits at their guaranteed rate, and the arrival stream is compared to the credit stream acting as a reference. From this comparison, we derive various parameters such as the amount of unspent credits of a flow and the waiting time of a packet since its corresponding credit arrived. We then design algorithms which prioritize flows based on these parameters. We demonstrate, both by rigorous theoretical proofs and by simulations, that these parameters can be bounded. By bounding these parameters, our schedulers provide various per-flow QoS guarantees on average rate, packet delay, queue length and fairness.by Anthony Chi-Kong Kam.Ph.D

    Spatial parallelism in the routers of asynchronous on-chip networks

    Get PDF
    State-of-the-art multi-processor systems-on-chip use on-chip networks as their communication fabric. Although most on-chip networks are implemented synchronously, asynchronous on-chip networks have several advantages over their synchronous counterparts. Timing division multiplexing (TDM) flow control methods have been utilized in asynchronous on-chip networks extensively. The synchronization required by TDM leads to significant speed penalties. Compared with using TDM methods, spatial parallelism methods, such as the spatial division multiplexing (SDM) flow control method, achieve better network throughput with less area overhead.This thesis proposes several techniques to increase spatial parallelism in the routers of asynchronous on-chip networks.Channel slicing is a new pipeline structure that alleviates the speed penalty by removing the synchronization among bit-level data pipelines. It is also found out that the lookahead pipeline using early evaluated acknowledgement can be used in routers to further improve speed.SDM is a new flow control method proposed for asynchronous on-chip networks. It improves network throughput without introducing synchronization among buffers of different frames, which is required by TDM methods. It is also found that the area overhead of SDM is smaller than the virtual channel (VC) flow control method -- the most used TDM method. The major design problem of SDM is the area consuming crossbars. A novel 2-stage Clos switch structure is proposed to replace the crossbar in SDM routers, which significantly reduces the area overhead. This Clos switch is dynamically reconfigured by a new asynchronous Clos scheduler.Several asynchronous SDM routers are implemented using these new techniques. An asynchronous VC router is also reproduced for comparison. Performance analyses show that the SDM routers outperform the VC router in throughput, area overhead and energy efficiency.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Spatial parallelism in the routers of asynchronous on-chip networks

    Get PDF
    State-of-the-art multi-processor systems-on-chip use on-chip networks as their communication fabric. Although most on-chip networks are implemented synchronously, asynchronous on-chip networks have several advantages over their synchronous counterparts. Timing division multiplexing (TDM) flow control methods have been utilized in asynchronous on-chip networks extensively. The synchronization required by TDM leads to significant speed penalties. Compared with using TDM methods, spatial parallelism methods, such as the spatial division multiplexing (SDM) flow control method, achieve better network throughput with less area overhead.This thesis proposes several techniques to increase spatial parallelism in the routers of asynchronous on-chip networks.Channel slicing is a new pipeline structure that alleviates the speed penalty by removing the synchronization among bit-level data pipelines. It is also found out that the lookahead pipeline using early evaluated acknowledgement can be used in routers to further improve speed.SDM is a new flow control method proposed for asynchronous on-chip networks. It improves network throughput without introducing synchronization among buffers of different frames, which is required by TDM methods. It is also found that the area overhead of SDM is smaller than the virtual channel (VC) flow control method -- the most used TDM method. The major design problem of SDM is the area consuming crossbars. A novel 2-stage Clos switch structure is proposed to replace the crossbar in SDM routers, which significantly reduces the area overhead. This Clos switch is dynamically reconfigured by a new asynchronous Clos scheduler.Several asynchronous SDM routers are implemented using these new techniques. An asynchronous VC router is also reproduced for comparison. Performance analyses show that the SDM routers outperform the VC router in throughput, area overhead and energy efficiency.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore