9 research outputs found

    Energy Saving and Virtualization Technologies in Switching

    Get PDF
    Switching is the key functionality for many devices like electronic Router and Switch, optical Router, Network on Chips (NoCs) and so on. Basically, switching is responsible for moving data unit from one port/location to another (or multiple) port(s)/location(s). In past years, the high capacity, low delay were the main concerns when designing high-end switching unit. As new demands, requests and technologies emerge, flexibility and low power cost switching design become to weight the same as throughput and delay. On one hand, highly flexible (i.e, programming ability) switching can cope with variable needs stem from new applications (i.e, VoIP) and popular user behavior (i.e, p2p downloading); on the other hand, reduce the energy and power dissipation for switching could not only save bills and build echo system but also expand components life time. Many research efforts have been devoted to increase switching flexibility and reduce its power cost. In this thesis work, we consider to exploit virtualization as the main technique to build flexible software router in the first part, then in the second part we draw our attention on energy saving in NoC (i.e, a switching fabric designed to handle the on chip data transmission) and software router. In the first part of the thesis, we consider the virtualization inside Software Routers (SRs). SR, i.e, routers running in commodity Personal Computers (PCs), become an appealing solution compared to traditional Proprietary Routing Devices (PRD) for various reasons such as cost (the multi-vendor hardware used by SRs can be cheap, while the equipment needed by PRDs is more expensive and their training cost is higher), openness (SRs can make use of a large number of open source networking applications, while PRDs are more closed) and flexibility. The forwarding performance provided by SRs has been an obstacle to their deployment in real networks. For this reason, we proposed to aggregate multiple routing units that form an powerful SR known as the Multistage Software Router (MSR) to overcome the performance limitation for a single SR. Our results show that the throughput can increase almost linearly as the number of the internal routing devices. But some other features related to flexibility (such as power saving, programmability, router migration or easy management) have been investigated less than performance previously. We noticed that virtualization techniques become reality thanks to the quick development of the PC architectures, which are now able to easily support several logical PCs running in parallel on the same hardware. Virtualization could provide many flexible features like hardware and software decoupling, encapsulation of virtual machine state, failure recovery and security, to name a few. Virtualization permits to build multiple SRs inside one physical host and a multistage architecture exploiting only logical devices. By doing so, physical resources can be used in a more efficient way, energy savings features (switching on and off device when needed) can be introduced and logical resources could be rented on-demand instead of being owned. Since virtualization techniques are still difficult to deploy, several challenges need to be faced when trying to integrate them into routers. The main aim of the first part in this thesis is to find out the feasibility of the virtualization approach, to build and test virtualized SR (VSR), to implement the MSR exploiting logical, i.e. virtualized, resources, to analyze virtualized routing performance and to propose improvement techniques to VSR and virtual MSR (VMSR). More specifically, we considered different virtualization solutions like VMware, XEN, KVM to build VSR and VMSR, being VMware a closed source solution but with higher performance and XEN/KVM open source solutions. Firstly we built and tested each single component of our multistage architecture (i.e, back-end router, load balancer )inside the virtual infrastructure, then and we extended the performance experiments with more complex scenarios like multiple Back-end Router (BR) or Load Balancer (LB) which cooperate to route packets. Our results show that virtualization could introduce 40~\% performance penalty compare with the hardware only solution. Keep the performance limitation in mind, we developed the whole VMSR and we obtained low throughput with 64B packet flow as expected. To increase the VMSR throughput, two directions could be considered, the first one is to improve the single component ( i.e, VSR) performance and the other is to work from the topology (i.e, best allocation of the VMs into the hardware ) point of view. For the first method, we considered to tune the VSR inside the KVM and we studied closely such as Linux driver, scheduler, interconnect methodology which could impact the performance significantly with proper configuration; then we proposed two ways for the VMs allocation into physical servers to enhance the VMSR performance. Our results show that with good tuning and allocation of VMs, we could minimize the virtualization penalty and get reasonable throughput for running SRs inside virtual infrastructure and add flexibility functionalities into SRs easily. In the second part of the thesis, we consider the energy efficient switching design problem and we focus on two main architecture, the NoC and MSR. As many research works suggest, the energy cost in the Communication Technologies ( ICT ) is constantly increasing. Among the main ICT sectors, a large portion of the energy consumption is contributed by the telecommunication infrastructure and their devices, i.e, router, switch, cell phone, ip TV settle box, storage home gateway etc. More in detail, the linecards, links, System on Chip (SoC) including the transmitter/receiver on these variate devices are the main power consuming units. We firstly present the work on the power reduction of the data transmission in SoC, which is carried out by the NoC. NoC is an approach to design the communication subsystem between different Processing Units (PEs) in a SoC. PEs could be different elements such as CPU, memory, digital signal/analog signal processor etc. Different PEs performs specific tasks depending on the applications running on the chip. Different tasks need to exchange data information among each other, thus flits ( chopped packet with limited header information ) are generated by PEs. The flits are injected into the NoC by the proper interface and routed until reach the destination PEs. For the whole procedure, the NoC behaves as a packet switch network. Studies show that in general the information processing in the PEs only consume 60~\% energy while the remaining 40~\% are consumed by the NoC. More importantly, as the current network designing principle, the NoC capacity is devised to handle the peak load. This is a clear sign for energy saving when the network load is low. In our work, we considered to exploit Dynamic Voltage and Frequency Scaling (DVFS) technique, which can jointly decrease or increase the system voltage and frequency when necessary, i.e, decrease the voltage and frequency at low load scenario to save energy and reduce power dissipation. More precisely, we studied two different NoC architectures for energy saving, namely single plane chip and multi-plane chip architecture. In both cases we have a very strict constraint to be that all the links and transmitter/receivers on the same plane work at the same frequency/voltage to avoid synchronization problem. This is the main difference with many existing works in the literature which usually assume different links can work at different frequency, that is hard to be implemented in reality. For the single plane NoC, we exploited different routing schemas combined with DVFS to reduce the power for the whole chip. Our results haven been compared with the optimal value obtained by modeling the power saving formally as a quadratic programming problem. Results suggest that just by using simple load balancing routing algorithm, we can save considerable energy for the single chip NoC architecture. Furthermore, we noticed that in the single plane NoC architecture, the bottleneck link could limit the DVFS effectiveness. Then we discovered that multiplane NoC architecture is fairly easy to be implemented and it could help with the energy saving. Thus we focus on the multiplane architecture and we found out that DVFS could be more efficient when we concentrate more traffic into one plane and send the remaining flows to other planes. We compared load concentration and load balancing with different power modeling and all simulation results show that load concentration is better compared with load balancing for multiplan NoC architecture. Finally, we also present one of the the energy efficient MSR design technique, which permits the MSR to follow the day-night traffic pattern more efficiently with our on-line energy saving algorithm

    Multistage Packet-Switching Fabrics for Data Center Networks

    Get PDF
    Recent applications have imposed stringent requirements within the Data Center Network (DCN) switches in terms of scalability, throughput and latency. In this thesis, the architectural design of the packet-switches is tackled in different ways to enable the expansion in both the number of connected endpoints and traffic volume. A cost-effective Clos-network switch with partially buffered units is proposed and two packet scheduling algorithms are described. The first algorithm adopts many simple and distributed arbiters, while the second approach relies on a central arbiter to guarantee an ordered packet delivery. For an improved scalability, the Clos switch is build using a Network-on-Chip (NoC) fabric instead of the common crossbar units. The Clos-UDN architecture made with Input-Queued (IQ) Uni-Directional NoC modules (UDNs) simplifies the input line cards and obviates the need for the costly Virtual Output Queues (VOQs). It also avoids the need for complex, and synchronized scheduling processes, and offers speedup, load balancing, and good path diversity. Under skewed traffic, a reliable micro load-balancing contributes to boosting the overall network performance. Taking advantage of the NoC paradigm, a wrapped-around multistage switch with fully interconnected Central Modules (CMs) is proposed. The architecture operates with a congestion-aware routing algorithm that proactively distributes the traffic load across the switching modules, and enhances the switch performance under critical packet arrivals. The implementation of small on-chip buffers has been made perfectly feasible using the current technology. This motivated the implementation of a large switching architecture with an Output-Queued (OQ) NoC fabric. The design merges assets of the output queuing, and NoCs to provide high throughput, and smooth latency variations. An approximate analytical model of the switch performance is also proposed. To further exploit the potential of the NoC fabrics and their modularity features, a high capacity Clos switch with Multi-Directional NoC (MDN) modules is presented. The Clos-MDN switching architecture exhibits a more compact layout than the Clos-UDN switch. It scales better and faster in port count and traffic load. Results achieved in this thesis demonstrate the high performance, expandability and programmability features of the proposed packet-switches which makes them promising candidates for the next-generation data center networking infrastructure

    Multistage Packet-Switching Fabrics for Data Center Networks

    Get PDF
    Recent applications have imposed stringent requirements within the Data Center Network (DCN) switches in terms of scalability, throughput and latency. In this thesis, the architectural design of the packet-switches is tackled in different ways to enable the expansion in both the number of connected endpoints and traffic volume. A cost-effective Clos-network switch with partially buffered units is proposed and two packet scheduling algorithms are described. The first algorithm adopts many simple and distributed arbiters, while the second approach relies on a central arbiter to guarantee an ordered packet delivery. For an improved scalability, the Clos switch is build using a Network-on-Chip (NoC) fabric instead of the common crossbar units. The Clos-UDN architecture made with Input-Queued (IQ) Uni-Directional NoC modules (UDNs) simplifies the input line cards and obviates the need for the costly Virtual Output Queues (VOQs). It also avoids the need for complex, and synchronized scheduling processes, and offers speedup, load balancing, and good path diversity. Under skewed traffic, a reliable micro load-balancing contributes to boosting the overall network performance. Taking advantage of the NoC paradigm, a wrapped-around multistage switch with fully interconnected Central Modules (CMs) is proposed. The architecture operates with a congestion-aware routing algorithm that proactively distributes the traffic load across the switching modules, and enhances the switch performance under critical packet arrivals. The implementation of small on-chip buffers has been made perfectly feasible using the current technology. This motivated the implementation of a large switching architecture with an Output-Queued (OQ) NoC fabric. The design merges assets of the output queuing, and NoCs to provide high throughput, and smooth latency variations. An approximate analytical model of the switch performance is also proposed. To further exploit the potential of the NoC fabrics and their modularity features, a high capacity Clos switch with Multi-Directional NoC (MDN) modules is presented. The Clos-MDN switching architecture exhibits a more compact layout than the Clos-UDN switch. It scales better and faster in port count and traffic load. Results achieved in this thesis demonstrate the high performance, expandability and programmability features of the proposed packet-switches which makes them promising candidates for the next-generation data center networking infrastructure

    Runtime Adaptive System-on-Chip Communication Architecture

    Get PDF
    The adaptive system provides adaptivity both in the system-level and in the architecture-level. The system-level adaptation is provided using a runtime application mapping. The architecture-level adaptation is implemented by using several novel methodologies to increase the resource utilization of the underlying silicon fabric, i.e. sharing the Virtual Channel Buffers among different output ports. To achieve successful runtime adaptation, a runtime observability infrastructure is included

    Técnicas de caminhos disjuntos para roteamento em systems-on-chip.

    Get PDF
    Resumo: Systems-on-chip (SoCs) são sistemas compostos contidos em um único substrato de silício. Os SoCs foram introduzidos nas metodologias de projeto para atender `a crescente demanda de aplicações complexas que requerem um grande poder computacional para sua execução. A utilização de SoCs contribui para uma diminuição de consumo de potência, pela ausência de um clock global, e para uma diminuição da área utilizada, visto que os componentes contidos em blocos são resultantes de projetos otimizados. As aplicações são compostas por subsistemas presentes em blocos distintos de lógica, cuja interação requer meios de comuni cação eficientes para seu adequado funcionamento. Devido `a demanda por uma maneira eficiente de comunicação interna aos SoCs, surgiram as chamadas Networks-on-chip (NoCs). Assim como nas redes tradicionais, as NoCs possuem problemas a serem resolvidos, dentre eles, a criação de técnicas eficientes de roteamento. Apesar da maioria das NoCs implementadas comercialmente utilizarem uma técnica conhecida como wormhole, na qual um n´o (ou vértice) estabelece um caminho direto até o nó alvo, ainda faz-se necessário evitar a competição por rotas já utilizadas. Dessa forma generalizamos o problema em se obter caminhos disjuntos internos aos SoCs em um problema de grafos conhecido como árvores geradoras independentes, que pode ser descrito resumidamente como: Dado um grafo G, um conjunto de árvores geradoras enraizadas em um vértice r em G é dito vértice/aresta independente se, par a cada vértice v em G, v != r, os caminhos de r a v em qualquer par de árvores são vértice/aresta disjuntos. Se a conectividade de G ´e k, o problema resume-se `a construção de k árvores geradoras independentes com cada vértice do grafo como raiz de tais árvores. Esse problema permanece em aberto para grafos em geral com conectividade k " 4. No entanto, foi demonstrado que, para um hipercubo de dimensão k, denotado por Qk, existem k árvores geradoras enraizadas em um vértice arbitrário de G. Neste trabalho é proposto um algoritmo para gerar k árvores geradoras independentes sobre hipercubos com o intuito de utilizá-lo na elaboração de técnicas de roteamento eficientes entre os núcleos distintos dos SoCs , assim como emprocessadores de múltiplos núcleos. Dentre as contribuições deste trabalho enfatizamos o consumo reduzido de recursos computacionais utilizados pelo algoritmo proposto, como: memória e processamento; assim como a modificação do algoritmo ECUBE para se construir rotas disjuntas sem que seja necessária a construção completa das árvores geradoras independentes. O algoritmo proposto se comporta de forma similar aos algoritmos comumente utilizados em roteamentos em NoCs, conforme mostrado pela utilização de um simulador de NoCs: Noxim. A construção de árvores geradoras disjuntas tende a seguir um dos dois objetivos: construção eficiente e altura ótima. Este trabalho tem o foco na construção eficiente, sendo que a altura ótima foi um resultado inerente ao método de construção proposto

    Simulation Modelling of Distributed-Shared Memory Multiprocessors

    Get PDF
    Institute for Computing Systems ArchitectureDistributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process. This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance. DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture
    corecore