135 research outputs found

    MoonGen: A Scriptable High-Speed Packet Generator

    Full text link
    We present MoonGen, a flexible high-speed packet generator. It can saturate 10 GbE links with minimum sized packets using only a single CPU core by running on top of the packet processing framework DPDK. Linear multi-core scaling allows for even higher rates: We have tested MoonGen with up to 178.5 Mpps at 120 Gbit/s. We move the whole packet generation logic into user-controlled Lua scripts to achieve the highest possible flexibility. In addition, we utilize hardware features of Intel NICs that have not been used for packet generators previously. A key feature is the measurement of latency with sub-microsecond precision and accuracy by using hardware timestamping capabilities of modern commodity NICs. We address timing issues with software-based packet generators and apply methods to mitigate them with both hardware support on commodity NICs and with a novel method to control the inter-packet gap in software. Features that were previously only possible with hardware-based solutions are now provided by MoonGen on commodity hardware. MoonGen is available as free software under the MIT license at https://github.com/emmericp/MoonGenComment: Published at IMC 201

    NC-G-SIM: A Parameterized Generic Simulator for 2D-Mesh, 3D-Mesh

    Get PDF
    As chip density keeps doubling during each course of generation, the use of NoC has become an integral part of modern microprocessors and a very prevalent architectural feature of all types of SoCs. To meet the ever expanding communication challenges, diverse and novel NoC solutions are being developed which rely on accurate modeling and simulations to evaluate the impact and analyze their performances. Consequently, this aggravates the need to rely on simulation tools to probe and optimize these NoC architectures. In this work, we present NC-G-SIM (Network on Chip-Generic-SIMulator), a highly flexible, modular, cycle-accurate, configurable simulator for NoCs. To make NC-G-SIM suitable for advanced NoC exploration, it is made highly generic that supports extensive range of cores in any kind of topology whether 2D, 3D or irregular. Simulation results have been evaluated in terms of latencies, throughput and the amount of energy consumed during the simulation period at different levels

    Monitoring-aware network-on-chip design

    Get PDF

    Performance Analysis of RR and FQ Algorithms in Reconfigurable Routers

    Get PDF
    Currently, we are witnessing a trend in network routers to include reconfigurable hardware structures to provide flexibility at improved performance levels when compared to software-only implementations. This permits the run-time reconfiguration of the hardware resources, i.e., to change their functionality (for example, from one scheduling algorithm to another), to adapt to changing network scenarios. In particular, different scheduling algorithms are more efficient in handling a specific mix of incoming packet traffic in terms of various criteria (e.g., delay, jitter, throughput, and packet loss). Therefore, reconfigurable hardware is able to provide improved performance levels and to allow more efficient algorithms to be utilized when different incoming packet traffic patterns are encountered. This project investigates the possibilities to improve upon end-to-end delays, jitter, throughput, and packet loss by exploiting the availability of a flexible hardware structure such as an field-programmable gate array (FPGA). The aim of the project is to provide an overview on adaptive scheduling using reconfigurable hardware. Consequently, we investigate different scheduling algorithms that provide QoS provisioning for traffic streams that are sensitive to packet delay and jitter, e.g., mpeg video traffic. The investigation utilizes the NS-2 simulator for which we generate realistic network scenarios. Our approach is based on understanding which kind of traffic is passing in the network, and subsequently change the scheduling algorithm accordingly in the core router to meet specific performance requirements. The investigated scheduling algorithms are taken from two well-known families, i.e., Round Robin (RR) and Fair Queuing (FQ). Our investigation confirmed the idea on the behavior of the two investigated scheduling algorithm: WFQ outperforms WRR in terms of end-to-end delay, jitter and throughput but it is more expensive than it at a computational level. Nonetheless, it is possible to find a tradeoff between the required area in FPGA and the level of performance desired for a kind of stream

    Performance evaluation of different routing algorithms in network on chip

    Get PDF
    Network on Chip (NoC) is one of the efficient on-chip communication architecture for System on Chip (SoC) where a large number of computational and storage blocks are integrated on a single chip. NoCs have tackled the disadvantages of SoCs as well as they are scalable. But an efficient routing algorithm can enhance the performance of NoC. In one chapter of the thesis three different types of routing algorithms are compared i.e. XY, OE, and DyAD. XY routing algorithm is a distributed deterministic algorithm. Odd-Even (OE) routing algorithm is distributed adaptive routing algorithm with deadlock-free ability. DyAD is a smart routing algorithm which combines the features of both deterministic and adaptive routing. In another chapter of thesis three different types of deadlock free routing algorithms are compared i.e. one deterministic routing (XY routing algorithm), three partially adaptive routing (West first, North last and Negative first) and two adaptive routing (DyXY, OE) are being compared with % of load for various traffic patterns. In another chapter of thesis, a fault tolerant algorithm is described and its performance is compared with all the deadlock free routing algorithms in a NoC having link faults and node faults. All these simulation is done in NIRGAM 2.1 simulator which is a cycle accurate systemC based simulator

    Ethernet Networks for Real-Time Use in the ATLAS Experiment

    Get PDF
    Ethernet became today's de-facto standard technology for local area networks. Defined by the IEEE 802.3 and 802.1 working groups, the Ethernet standards cover technologies deployed at the first two layers of the OSI protocol stack. The architecture of modern Ethernet networks is based on switches. The switches are devices usually built using a store-and-forward concept. At the highest level, they can be seen as a collection of queues and mathematically modelled by means of queuing theory. However, the traffic profiles on modern Ethernet networks are rather different from those assumed in classical queuing theory. The standard recommendations for evaluating the performance of network devices define the values that should be measured but do not specify a way of reconciling these values with the internal architecture of the switches. The introduction of the 10 Gigabit Ethernet standard provided a direct gateway from the LAN to the WAN by the means of the WAN PHY. Certain aspects related to the actual use of WAN PHY technology were vaguely defined by the standard. The ATLAS experiment at CERN is scheduled to start operation at CERN in 2007. The communication infrastructure of the Trigger and Data Acquisition System will be built using Ethernet networks. The real-time operational needs impose a requirement for predictable performance on the network part. In view of the diversity of the architectures of Ethernet devices, testing and modelling is required in order to make sure the full system will operate predictably. This thesis focuses on the testing part of the problem and addresses issues in determining the performance for both LAN and WAN connections. The problem of reconciling results from measurements to architectural details of the switches will also be tackled. We developed a scalable traffic generator system based on commercial-off-the-shelf Gigabit Ethernet network interface cards. The generator was able to transmit traffic at the nominal Gigabit Ethernet line rate for all frame sizes specified in the Ethernet standard. The calculation of latency was performed with accuracy in the range of +/- 200 ns. We indicate how certain features of switch architectures may be identified through accurate throughput and latency values measured for specific traffic distributions. At this stage, we present a detailed analysis of Ethernet broadcast support in modern switches. We use a similar hands-on approach to address the problem of extending Ethernet networks over long distances. Based on the 1 Gbit/s traffic generator used in the LAN, we develop a methodology to characterise point-to-point connections over long distance networks. At higher speeds, a combination of commercial traffic generators and high-end servers is employed to determine the performance of the connection. We demonstrate that the new 10 Gigabit Ethernet technology can interoperate with the installed base of SONET/SDH equipment through a series of experiments on point-to-point circuits deployed over long-distance network infrastructure in a multi-operator domain. In this process, we provide a holistic view of the end-to-end performance of 10 Gigabit Ethernet WAN PHY connections through a sequence of measurements starting at the physical transmission layer and continuing up to the transport layer of the OSI protocol stack

    The Customizable Virtual FPGA: Generation, System Integration and Configuration of Application-Specific Heterogeneous FPGA Architectures

    Get PDF
    In den vergangenen drei Jahrzehnten wurde die Entwicklung von Field Programmable Gate Arrays (FPGAs) stark von Moore’s Gesetz, Prozesstechnologie (Skalierung) und kommerziellen Märkten beeinflusst. State-of-the-Art FPGAs bewegen sich einerseits dem Allzweck näher, aber andererseits, da FPGAs immer mehr traditionelle Domänen der Anwendungsspezifischen integrierten Schaltungen (ASICs) ersetzt haben, steigen die Effizienzerwartungen. Mit dem Ende der Dennard-Skalierung können Effizienzsteigerungen nicht mehr auf Technologie-Skalierung allein zurückgreifen. Diese Facetten und Trends in Richtung rekonfigurierbarer System-on-Chips (SoCs) und neuen Low-Power-Anwendungen wie Cyber Physical Systems und Internet of Things erfordern eine bessere Anpassung der Ziel-FPGAs. Neben den Trends für den Mainstream-Einsatz von FPGAs in Produkten des täglichen Bedarfs und Services wird es vor allem bei den jüngsten Entwicklungen, FPGAs in Rechenzentren und Cloud-Services einzusetzen, notwendig sein, eine sofortige Portabilität von Applikationen über aktuelle und zukünftige FPGA-Geräte hinweg zu gewährleisten. In diesem Zusammenhang kann die Hardware-Virtualisierung ein nahtloses Mittel für Plattformunabhängigkeit und Portabilität sein. Ehrlich gesagt stehen die Zwecke der Anpassung und der Virtualisierung eigentlich in einem Konfliktfeld, da die Anpassung für die Effizienzsteigerung vorgesehen ist, während jedoch die Virtualisierung zusätzlichen Flächenaufwand hinzufügt. Die Virtualisierung profitiert aber nicht nur von der Anpassung, sondern fügt auch mehr Flexibilität hinzu, da die Architektur jederzeit verändert werden kann. Diese Besonderheit kann für adaptive Systeme ausgenutzt werden. Sowohl die Anpassung als auch die Virtualisierung von FPGA-Architekturen wurden in der Industrie bisher kaum adressiert. Trotz einiger existierenden akademischen Werke können diese Techniken noch als unerforscht betrachtet werden und sind aufstrebende Forschungsgebiete. Das Hauptziel dieser Arbeit ist die Generierung von FPGA-Architekturen, die auf eine effiziente Anpassung an die Applikation zugeschnitten sind. Im Gegensatz zum üblichen Ansatz mit kommerziellen FPGAs, bei denen die FPGA-Architektur als gegeben betrachtet wird und die Applikation auf die vorhandenen Ressourcen abgebildet wird, folgt diese Arbeit einem neuen Paradigma, in dem die Applikation oder Applikationsklasse fest steht und die Zielarchitektur auf die effiziente Anpassung an die Applikation zugeschnitten ist. Dies resultiert in angepassten anwendungsspezifischen FPGAs. Die drei Säulen dieser Arbeit sind die Aspekte der Virtualisierung, der Anpassung und des Frameworks. Das zentrale Element ist eine weitgehend parametrierbare virtuelle FPGA-Architektur, die V-FPGA genannt wird, wobei sie als primäres Ziel auf jeden kommerziellen FPGA abgebildet werden kann, während Anwendungen auf der virtuellen Schicht ausgeführt werden. Dies sorgt für Portabilität und Migration auch auf Bitstream-Ebene, da die Spezifikation der virtuellen Schicht bestehen bleibt, während die physische Plattform ausgetauscht werden kann. Darüber hinaus wird diese Technik genutzt, um eine dynamische und partielle Rekonfiguration auf Plattformen zu ermöglichen, die sie nicht nativ unterstützen. Neben der Virtualisierung soll die V-FPGA-Architektur auch als eingebettetes FPGA in ein ASIC integriert werden, das effiziente und dennoch flexible System-on-Chip-Lösungen bietet. Daher werden Zieltechnologie-Abbildungs-Methoden sowohl für Virtualisierung als auch für die physikalische Umsetzung adressiert und ein Beispiel für die physikalische Umsetzung in einem 45 nm Standardzellen Ansatz aufgezeigt. Die hochflexible V-FPGA-Architektur kann mit mehr als 20 Parametern angepasst werden, darunter LUT-Grösse, Clustering, 3D-Stacking, Routing-Struktur und vieles mehr. Die Auswirkungen der Parameter auf Fläche und Leistung der Architektur werden untersucht und eine umfangreiche Analyse von über 1400 Benchmarkläufen zeigt eine hohe Parameterempfindlichkeit bei Abweichungen bis zu ±95, 9% in der Fläche und ±78, 1% in der Leistung, was die hohe Bedeutung von Anpassung für Effizienz aufzeigt. Um die Parameter systematisch an die Bedürfnisse der Applikation anzupassen, wird eine parametrische Entwurfsraum-Explorationsmethode auf der Basis geeigneter Flächen- und Zeitmodellen vorgeschlagen. Eine Herausforderung von angepassten Architekturen ist der Entwurfsaufwand und die Notwendigkeit für angepasste Werkzeuge. Daher umfasst diese Arbeit ein Framework für die Architekturgenerierung, die Entwurfsraumexploration, die Anwendungsabbildung und die Evaluation. Vor allem ist der V-FPGA in einem vollständig synthetisierbaren generischen Very High Speed Integrated Circuit Hardware Description Language (VHDL) Code konzipiert, der sehr flexibel ist und die Notwendigkeit für externe Codegeneratoren eliminiert. Systementwickler können von verschiedenen Arten von generischen SoC-Architekturvorlagen profitieren, um die Entwicklungszeit zu reduzieren. Alle notwendigen Konstruktionsschritte für die Applikationsentwicklung und -abbildung auf den V-FPGA werden durch einen Tool-Flow für Entwurfsautomatisierung unterstützt, der eine Sammlung von vorhandenen kommerziellen und akademischen Werkzeugen ausnutzt, die durch geeignete Modelle angepasst und durch ein neues Werkzeug namens V-FPGA-Explorer ergänzt werden. Dieses neue Tool fungiert nicht nur als Back-End-Tool für die Anwendungsabbildung auf dem V-FPGA sondern ist auch ein grafischer Konfigurations- und Layout-Editor, ein Bitstream-Generator, ein Architekturdatei-Generator für die Place & Route Tools, ein Script-Generator und ein Testbenchgenerator. Eine Besonderheit ist die Unterstützung der Just-in-Time-Kompilierung mit schnellen Algorithmen für die In-System Anwendungsabbildung. Die Arbeit schliesst mit einigen Anwendungsfällen aus den Bereichen industrielle Prozessautomatisierung, medizinische Bildgebung, adaptive Systeme und Lehre ab, in denen der V-FPGA eingesetzt wird

    Microarchitectural techniques to reduce energy consumption in the memory hierarchy

    Get PDF
    This thesis states that dynamic profiling of the memory reference stream can improve energy and performance in the memory hierarchy. The research presented in this theses provides multiple instances of using lightweight hardware structures to profile the memory reference stream. The objective of this research is to develop microarchitectural techniques to reduce energy consumption at different levels of the memory hierarchy. Several simple and implementable techniques were developed as a part of this research. One of the techniques identifies and eliminates redundant refresh operations in DRAM and reduces DRAM refresh power. Another, reduces leakage energy in L2 and higher level caches for multiprocessor systems. The emphasis of this research has been to develop several techniques of obtaining energy savings in caches using a simple hardware structure called the counting Bloom filter (CBF). CBFs have been used to predict L2 cache misses and obtain energy savings by not accessing the L2 cache on a predicted miss. A simple extension of this technique allows CBFs to do way-estimation of set associative caches to reduce energy in cache lookups. Another technique using CBFs track addresses in a Virtual Cache and reduce false synonym lookups. Finally this thesis presents a technique to reduce dynamic power consumption in level one caches using significance compression. The significant energy and performance improvements demonstrated by the techniques presented in this thesis suggest that this work will be of great value for designing memory hierarchies of future computing platforms.Ph.D.Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhaka

    Novel techniques in large scaleable ATM switches

    Get PDF
    Bibliography: p. 172-178.This dissertation explores the research area of large scale ATM switches. The requirements for an ATM switch are determined by overviewing the ATM network architecture. These requirements lead to the discussion of an abstract ATM switch which illustrates the components of an ATM switch that automatically scale with increasing switch size (the Input Modules and Output Modules) and those that do not (the Connection Admission Control and Switch Management systems as well as the Cell Switch Fabric). An architecture is suggested which may result in a scalable Switch Management and Connection Admission Control function. However, the main thrust of the dissertation is confined to the cell switch fabric. The fundamental mathematical limits of ATM switches and buffer placement is presented next emphasising the desirability of output buffering. This is followed by an overview of the possible routing strategies in a multi-stage interconnection network. A variety of space division switches are then considered which leads to a discussion of the hypercube fabric, (a novel switching technique). The hypercube fabric achieves good performance with an O(N.log₂N)²) scaling. The output module, resequencing, cell scheduling and output buffering technique is presented leading to a complete description of the proposed ATM switch. Various traffic models are used to quantify the switch's performance. These include a simple exponential inter-arrival time model, a locality of reference model and a self-similar, bursty, multiplexed Variable Bit Rate (VBR) model. FIFO queueing is simple to implement in an ATNI switch, however, more responsive queueing strategies can result in an improved performance. An associative memory is presented which allows the separate queues in the ATM switch to be effectively logically combined into a single FIFO queue. The associative memory is described in detail and its feasibility is shown by laying out the Integrated Circuit masks and performing an analogue simulation of the IC's performance is SPICE3. Although optimisations were required to the original design, the feasibility of the approach is shown with a 15Ƞs write time and a 160Ƞs read time for a 32 row, 8 priority bit, 10 routing bit version of the memory. This is achieved with 2µm technology, more advanced technologies may result in even better performance. The various traffic models and switch models are simulated in a number of runs. This shows the performance of the hypercube which outperforms a Clos network of equivalent technology and approaches the performance of an ideal reference fabric. The associative memory leverages a significant performance advantage in the hypercube network and a modest advantage in the Clos network. The performance of the switches is shown to degrade with increasing traffic density, increasing locality of reference, increasing variance in the cell rate and increasing burst length. Interestingly, the fabrics show no real degradation in response to increasing self similarity in the fabric. Lastly, the appendices present suggestions on how redundancy, reliability and multicasting can be achieved in the hypercube fabric. An overview of integrated circuits is provided. A brief description of commercial ATM switching products is given. Lastly, a road map to the simulation code is provided in the form of descriptions of the functionality found in all of the files within the source tree. This is intended to provide the starting ground for anyone wishing to modify or extend the simulation system developed for this thesis
    corecore