188 research outputs found

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration

    Control Plane Hardware Design for Optical Packet Switched Data Centre Networks

    Get PDF
    Optical packet switching for intra-data centre networks is key to addressing traffic requirements. Photonic integration and wavelength division multiplexing (WDM) can overcome bandwidth limits in switching systems. A promising technology to build a nanosecond-reconfigurable photonic-integrated switch, compatible with WDM, is the semiconductor optical amplifier (SOA). SOAs are typically used as gating elements in a broadcast-and-select (B\&S) configuration, to build an optical crossbar switch. For larger-size switching, a three-stage Clos network, based on crossbar nodes, is a viable architecture. However, the design of the switch control plane, is one of the barriers to packet switching; it should run on packet timescales, which becomes increasingly challenging as line rates get higher. The scheduler, used for the allocation of switch paths, limits control clock speed. To this end, the research contribution was the design of highly parallel hardware schedulers for crossbar and Clos network switches. On a field-programmable gate array (FPGA), the minimum scheduler clock period achieved was 5.0~ns and 5.4~ns, for a 32-port crossbar and Clos switch, respectively. By using parallel path allocation modules, one per Clos node, a minimum clock period of 7.0~ns was achieved, for a 256-port switch. For scheduler application-specific integrated circuit (ASIC) synthesis, this reduces to 2.0~ns; a record result enabling scalable packet switching. Furthermore, the control plane was demonstrated experimentally. Moreover, a cycle-accurate network emulator was developed to evaluate switch performance. Results showed a switch saturation throughput at a traffic load 60\% of capacity, with sub-microsecond packet latency, for a 256-port Clos switch, outperforming state-of-the-art optical packet switches

    Design, Implementation and Evaluation of a Configurable NoC for AcENoCs FPGA Accelerated Emulation Platform

    Get PDF
    The heterogenous nature and the demand for extensive parallel processing in modern applications have resulted in widespread use of Multicore System-on-Chip (SoC) architectures. The emerging Network-on-Chip (NoC) architecture provides an energy-efficient and scalable communication solution for Multicore SoCs, serving as a powerful replacement for traditional bus-based solutions. The key to successful realization of such architectures is a flexible, fast and robust emulation platform for fast design space exploration. In this research, we present the design and evaluation of a highly configurable NoC used in AcENoCs (Accelerated Emulation platform for NoCs), a flexible and cycle accurate field programmable gate array (FPGA) emulation platform for validating NoC architectures. Along with the implementation details, we also discuss the various design optimizations and tradeoffs, and assess the performance improvements of AcENoCs over existing simulators and emulators. We design a hardware library consisting of routers and links using verilog hardware description language (HDL). The router is parameterized and has a configurable number of physical ports, virtual channels (VCs) and pipeline depth. A packet switched NoC is constructed by connecting the routers in either 2D-Mesh or 2D-Torus topology. The NoC is integrated in the AcENoCs platform and prototyped on Xilinx Virtex-5 FPGA. The NoC was evaluated under various synthetic and realistic workloads generated by AcENoCs' traffic generators implemented on the Xilinx MicroBlaze embedded processor. In order to validate the NoC design, performance metrics like average latency and throughput were measured and compared against the results obtained using standard network simulators. FPGA implementation of the NoC using Xilinx tools indicated a 76% LUT utilization for a 5x5 2D-Mesh network. A VC allocator was found to be the single largest consumer of hardware resources within a router. The router design synthesized at a frequency of 135MHz, 124MHz and 109MHz for 3-port, 4-port and 5-port configurations, respectively. The operational frequency of the router in the AcENoCs environment was limited only by the software execution latency even though the hardware itself could be clocked at a much higher rate. An AcENoCs emulator showed speedup improvements of 10000-12000X over HDL simulators and 5-15X over software simulators, without sacrificing cycle accuracy

    Advances in Architectures and Tools for FPGAs and their Impact on the Design of Complex Systems for Particle Physics

    Get PDF
    The continual improvement of semiconductor technology has provided rapid advancements in device frequency and density. Designers of electronics systems for high-energy physics (HEP) have benefited from these advancements, transitioning many designs from fixed-function ASICs to more flexible FPGA-based platforms. Today’s FPGA devices provide a significantly higher amount of resources than those available during the initial Large Hadron Collider design phase. To take advantage of the capabilities of future FPGAs in the next generation of HEP experiments, designers must not only anticipate further improvements in FPGA hardware, but must also adopt design tools and methodologies that can scale along with that hardware. In this paper, we outline the major trends in FPGA hardware, describe the design challenges these trends will present to developers of HEP electronics, and discuss a range of techniques that can be adopted to overcome these challenges

    A real-time FPGA-based implementation of a high-performance MIMO-OFDM mobile WiMAX transmitter

    Get PDF
    The Multiple Input Multiple Output (MIMO)-Orthogonal Frequency Division Multiplexing (OFDM) is considered a key technology in modern wireless-access communication systems. The IEEE 802.16e standard, also denoted as mobile WiMAX, utilizes the MIMO-OFDM technology and it was one of the first initiatives towards the roadmap of fourth generation systems. This paper presents the PHY-layer design, implementation and validation of a high-performance real-time 2x2 MIMO mobile WiMAX transmitter that accounts for low-level deployment issues and signal impairments. The focus is mainly laid on the impact of the selected high bandwidth, which scales the implementation complexity of the baseband signal processing algorithms. The latter also requires an advanced pipelined memory architecture to timely address the datapath operations that involve high memory utilization. We present in this paper a first evaluation of the extracted results that demonstrate the performance of the system using a 2x2 MIMO channel emulation.Postprint (published version

    HW/SW Codesign and Design, Evaluation of Software Framework for AcENoCs : An FPGA-Accelerated NoC Emulation Platform

    Get PDF
    Majority of the modern day compute intensive applications are heterogeneous in nature. To support their ever increasing computational requirements, present day System-on-Chip (SoC) architectures have adapted multicore style of modeling, thereby incorporating multiple, heterogeneous processing cores on a single chip. The emerging Network-On-Chip (NoC) interconnect paradigm provides a scalable and power-efficient solution for communication among multiple cores, serving as a powerful replacement for traditional bus based architectures. A fast, robust and exible emulation platform is the key to successful realization and validation of such architectures within a very short span of time. This research focuses on various aspects of Hardware/Software (HW/SW) codesign for AcENoCs (Accelerated Emulation Platform for NoCs), a Field Programmable Gate Array (FPGA) accelerated, con gurable, cycle accurate platform for emulation and validation of NoC architectures. This work also details the design, implementation and evaluation of AcENoCs' software framework along with the various design optimizations carried out and tradeoffs considered in AcENoCs' HW/SW codesign for achieving an optimum balance between emulated network dimensions and emulation performance. AcENoCs emulation platform is realized on a Xilinx Virtex-5 FPGA. AcENoCs' hardware framework consists of the NoC built using configurable hardware library components, while the software framework consists of Traffic Generators (TGs) and their associated source queues, Traffic Receptors (TRs) along with statistics analysis module and dynamically controlled emulation clock generator. The software framework is implemented using on-chip Xilinx MicroBlaze processor. This report also describes the interaction between various HW/SW events in an emulation cycle and assesses AcENoCs' performance speedup and tradeoffs over existing FPGA emulators and software simulators. FPGA synthesis results showed that networks with dimensions upto 5x5 could be accommodated inside the device. Varying synthetic traffic workloads, generated by TGs, were used to evaluate the network. Real application based traces were also run on AcENoCs platform to evaluate the performance improvement achieved in comparison to software simulators. For improving the emulator performance, software profiling was carried out to identify and optimize the software components consuming highest number of processor cycles in an emulation cycle. Emulation testcases were run and latency values recorded for varying traffic patterns in order to evaluate AcENoCs platform. Experimental results showed emulation speedups in order of 10000-12000X over HDL (Hardware Description Language) simulators and 14-47X over software simulators, without sacri cing cycle accuracy

    Performance Aspects of Synthesizable Computing Systems

    Get PDF
    corecore