188 research outputs found
A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS
Accelerator-based -or heterogeneous- computing has become increasingly
important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes
custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while
ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited
power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can
play here a key role, as they enable unprecedented levels of power-efficiency
compared to CPUs/GPUs. However, such paradigms are still immature and
deeper exploration is indispensable.
This dissertation investigates customizability and proposes novel solutions
for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent
scratchpad memory with a configurable bank remapping system to reduce
bank conflicts. The experimental results show the benefits of both using a
customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed
synchronization master better suits many-cores than standard centralized
solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory
transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated
the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based
on the sparse directory approach, with a selective coherence maintenance
system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and
non-coherent architectural mechanism along with an extended coherence
protocol can enhance performance.
The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration
of power-efficient high-performance computing architectures. The system is
based on a NoC and a customizable GPU-like accelerator core, as well as
a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological
results as part of the contribution in this dissertation. In fact, as a key
benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms
do not always support a comprehensive heterogeneous architecture exploration
Control Plane Hardware Design for Optical Packet Switched Data Centre Networks
Optical packet switching for intra-data centre networks is key to addressing traffic requirements. Photonic integration and wavelength division multiplexing (WDM) can overcome bandwidth limits in switching systems. A promising technology to build a nanosecond-reconfigurable photonic-integrated switch, compatible with WDM, is the semiconductor optical amplifier (SOA). SOAs are typically used as gating elements in a broadcast-and-select (B\&S) configuration, to build an optical crossbar switch. For larger-size switching, a three-stage Clos network, based on crossbar nodes, is a viable architecture. However, the design of the switch control plane, is one of the barriers to packet switching; it should run on packet timescales, which becomes increasingly challenging as line rates get higher. The scheduler, used for the allocation of switch paths, limits control clock speed. To this end, the research contribution was the design of highly parallel hardware schedulers for crossbar and Clos network switches. On a field-programmable gate array (FPGA), the minimum scheduler clock period achieved was 5.0~ns and 5.4~ns, for a 32-port crossbar and Clos switch, respectively. By using parallel path allocation modules, one per Clos node, a minimum clock period of 7.0~ns was achieved, for a 256-port switch. For scheduler application-specific integrated circuit (ASIC) synthesis, this reduces to 2.0~ns; a record result enabling scalable packet switching. Furthermore, the control plane was demonstrated experimentally. Moreover, a cycle-accurate network emulator was developed to evaluate switch performance. Results showed a switch saturation throughput at a traffic load 60\% of capacity, with sub-microsecond packet latency, for a 256-port Clos switch, outperforming state-of-the-art optical packet switches
Design, Implementation and Evaluation of a Configurable NoC for AcENoCs FPGA Accelerated Emulation Platform
The heterogenous nature and the demand for extensive parallel processing in modern applications have resulted in widespread use of Multicore System-on-Chip (SoC) architectures. The emerging Network-on-Chip (NoC) architecture provides an energy-efficient and scalable communication solution for Multicore SoCs, serving as a powerful replacement for traditional bus-based solutions. The key to successful realization of such architectures is a flexible, fast and robust emulation platform for fast design space exploration. In this research, we present the design and evaluation of a highly configurable NoC used in AcENoCs (Accelerated Emulation platform for NoCs), a flexible and cycle accurate field programmable gate array (FPGA) emulation platform for validating NoC architectures. Along with the implementation details, we also discuss the various design optimizations and tradeoffs, and assess the performance improvements of AcENoCs over existing simulators and emulators. We design a hardware library consisting of routers and links using verilog hardware description language (HDL). The router is parameterized and has a configurable number of physical ports, virtual channels (VCs) and pipeline depth. A packet switched NoC is constructed by connecting the routers in either 2D-Mesh or 2D-Torus topology. The NoC is integrated in the AcENoCs platform and prototyped on Xilinx Virtex-5 FPGA. The NoC was evaluated under various synthetic and realistic workloads generated by AcENoCs' traffic generators implemented on the Xilinx MicroBlaze embedded processor. In order to validate the NoC design, performance metrics like average latency and throughput were measured and compared against the results obtained using standard network simulators. FPGA implementation of the NoC using Xilinx tools indicated a 76% LUT utilization for a 5x5 2D-Mesh network. A VC allocator was found to be the single largest consumer of hardware resources within a router. The router design synthesized at a frequency of 135MHz, 124MHz and 109MHz for 3-port, 4-port and 5-port configurations, respectively. The operational frequency of the router in the AcENoCs environment was limited only by the software execution latency even though the hardware itself could be clocked at a much higher rate. An AcENoCs emulator showed speedup improvements of 10000-12000X over HDL simulators and 5-15X over software simulators, without sacrificing cycle accuracy
Advances in Architectures and Tools for FPGAs and their Impact on the Design of Complex Systems for Particle Physics
The continual improvement of semiconductor technology has provided rapid advancements in device frequency and density. Designers of electronics systems for high-energy physics (HEP) have benefited from these advancements, transitioning many designs from fixed-function ASICs to more flexible FPGA-based platforms. Today’s FPGA devices provide a significantly higher amount of resources than those available during the initial Large Hadron Collider design phase. To take advantage of the capabilities of future FPGAs in the next generation of HEP experiments, designers must not only anticipate further improvements in FPGA hardware, but must also adopt design tools and methodologies that can scale along with that hardware. In this paper, we outline the major trends in FPGA hardware, describe the design challenges these trends will present to developers of HEP electronics, and discuss a range of techniques that can be adopted to overcome these challenges
A real-time FPGA-based implementation of a high-performance MIMO-OFDM mobile WiMAX transmitter
The Multiple Input Multiple Output (MIMO)-Orthogonal
Frequency Division Multiplexing (OFDM) is considered a key technology
in modern wireless-access communication systems. The IEEE 802.16e
standard, also denoted as mobile WiMAX, utilizes the MIMO-OFDM
technology and it was one of the first initiatives towards the roadmap of
fourth generation systems. This paper presents the PHY-layer design, implementation
and validation of a high-performance real-time 2x2 MIMO
mobile WiMAX transmitter that accounts for low-level deployment issues
and signal impairments. The focus is mainly laid on the impact of
the selected high bandwidth, which scales the implementation complexity
of the baseband signal processing algorithms. The latter also requires
an advanced pipelined memory architecture to timely address the datapath
operations that involve high memory utilization. We present in this
paper a first evaluation of the extracted results that demonstrate the
performance of the system using a 2x2 MIMO channel emulation.Postprint (published version
HW/SW Codesign and Design, Evaluation of Software Framework for AcENoCs : An FPGA-Accelerated NoC Emulation Platform
Majority of the modern day compute intensive applications are heterogeneous
in nature. To support their ever increasing computational requirements, present
day System-on-Chip (SoC) architectures have adapted multicore style of modeling,
thereby incorporating multiple, heterogeneous processing cores on a single chip. The
emerging Network-On-Chip (NoC) interconnect paradigm provides a scalable and
power-efficient solution for communication among multiple cores, serving as a powerful
replacement for traditional bus based architectures. A fast, robust and
exible
emulation platform is the key to successful realization and validation of such architectures
within a very short span of time.
This research focuses on various aspects of Hardware/Software (HW/SW) codesign
for AcENoCs (Accelerated Emulation Platform for NoCs), a Field Programmable
Gate Array (FPGA) accelerated, con gurable, cycle accurate platform for emulation
and validation of NoC architectures. This work also details the design, implementation
and evaluation of AcENoCs' software framework along with the various design
optimizations carried out and tradeoffs considered in AcENoCs' HW/SW codesign
for achieving an optimum balance between emulated network dimensions and emulation
performance. AcENoCs emulation platform is realized on a Xilinx Virtex-5
FPGA. AcENoCs' hardware framework consists of the NoC built using configurable
hardware library components, while the software framework consists of Traffic Generators
(TGs) and their associated source queues, Traffic Receptors (TRs) along with statistics analysis module and dynamically controlled emulation clock generator. The
software framework is implemented using on-chip Xilinx MicroBlaze processor. This
report also describes the interaction between various HW/SW events in an emulation
cycle and assesses AcENoCs' performance speedup and tradeoffs over existing FPGA
emulators and software simulators.
FPGA synthesis results showed that networks with dimensions upto 5x5 could be
accommodated inside the device. Varying synthetic traffic workloads, generated by
TGs, were used to evaluate the network. Real application based traces were also run
on AcENoCs platform to evaluate the performance improvement achieved in comparison
to software simulators. For improving the emulator performance, software
profiling was carried out to identify and optimize the software components consuming
highest number of processor cycles in an emulation cycle. Emulation testcases
were run and latency values recorded for varying traffic patterns in order to evaluate
AcENoCs platform. Experimental results showed emulation speedups in order
of 10000-12000X over HDL (Hardware Description Language) simulators and 14-47X
over software simulators, without sacri cing cycle accuracy
Customisable arithmetic hardware designs
Imperial Users onl
- …