276 research outputs found
Vector support for multicore processors with major emphasis on configurable multiprocessors
It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency.
This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips.
The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit.
However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value
REAL-TIME ADAPTIVE PULSE COMPRESSION ON RECONFIGURABLE, SYSTEM-ON-CHIP (SOC) PLATFORMS
New radar applications need to perform complex algorithms and process a large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low-power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression algorithms for real-time transceiver optimization is presented, and is based on a System-on-Chip architecture for reconfigurable hardware devices. This study also evaluates the performance of dedicated coprocessors as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion, which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through high-performance buses, to perform floating-point operations, control the processing blocks, and communicate with an external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band testbed together with a low-cost channel emulator for different types of waveforms
Customisable arithmetic hardware designs
Imperial Users onl
Application-Specific Number Representation
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), enable application-
specific number representations. Well-known number formats include fixed-point, floating-
point, logarithmic number system (LNS), and residue number system (RNS). Such different
number representations lead to different arithmetic designs and error behaviours, thus produc-
ing implementations with different performance, accuracy, and cost.
To investigate the design options in number representations, the first part of this thesis presents
a platform that enables automated exploration of the number representation design space. The
second part of the thesis shows case studies that optimise the designs for area, latency or
throughput from the perspective of number representations.
Automated design space exploration in the first part addresses the following two major issues:
² Automation requires arithmetic unit generation. This thesis provides optimised
arithmetic library generators for logarithmic and residue arithmetic units, which support
a wide range of bit widths and achieve significant improvement over previous designs.
² Generation of arithmetic units requires specifying the bit widths for each
variable. This thesis describes an automatic bit-width optimisation tool called R-Tool,
which combines dynamic and static analysis methods, and supports different number
systems (fixed-point, floating-point, and LNS numbers).
Putting it all together, the second part explores the effects of application-specific number
representation on practical benchmarks, such as radiative Monte Carlo simulation, and seismic
imaging computations. Experimental results show that customising the number representations
brings benefits to hardware implementations: by selecting a more appropriate number format,
we can reduce the area cost by up to 73.5% and improve the throughput by 14.2% to 34.1%; by
performing the bit-width optimisation, we can further reduce the area cost by 9.7% to 17.3%.
On the performance side, hardware implementations with customised number formats achieve
5 to potentially over 40 times speedup over software implementations
Design of an FPGA-based parallel SIMD machine for power flow analysis
Power flow analysis consists of computationally intensive calculations on large matrices, consumes several hours of computational time, and has shown the need for the implementation of application-specific parallel machines. The potential of Single-Instruction stream Multiple-Data stream (SIMD) parallel architectures for efficient operations on large matrices has been demonstrated as seen in the case of many existing supercomputers. The unsuitability of existing parallel machines for low-cost power system applications, their long design cycles, and the difficulty in using them show the need for application-specific SIMI) machines. Advances in VLSI technology and Field-Programmable Gate-Arrays (FPGAs) enable the implementation of Custom Computing Machines (CCMs) which can yield better performance for specific applications. The advent of SoftCore processors made it possible to integrate reconfigurable logic as a slave to a peripheral bus and has demonstrated the ability in the rapid prototyping of complete systems on programmable chips. This thesis aims at designing and implementing an FPGA-based SIMI) machine for power flow analysis. It presents the architecture of an SIMI) machine that consists of an array of processing elements with mesh interconnection and a Soft-Core processor; the latter is used as the host. The FPGAbased SIMI) machine is implemented on the Annapolis Microsystems Wildstar-II board that contains multiple Virtex-II FPGAs. The Soft-Core processor used is the Xilinx Microblaze and the application targeted is matrix multiplication
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
In recent years, the field of Deep Learning has seen many disruptive and
impactful advancements. Given the increasing complexity of deep neural
networks, the need for efficient hardware accelerators has become more and more
pressing to design heterogeneous HPC platforms. The design of Deep Learning
accelerators requires a multidisciplinary approach, combining expertise from
several areas, spanning from computer architecture to approximate computing,
computational models, and machine learning algorithms. Several methodologies
and tools have been proposed to design accelerators for Deep Learning,
including hardware-software co-design approaches, high-level synthesis methods,
specific customized compilers, and methodologies for design space exploration,
modeling, and simulation. These methodologies aim to maximize the exploitable
parallelism and minimize data movement to achieve high performance and energy
efficiency. This survey provides a holistic review of the most influential
design methodologies and EDA tools proposed in recent years to implement Deep
Learning accelerators, offering the reader a wide perspective in this rapidly
evolving field. In particular, this work complements the previous survey
proposed by the same authors in [203], which focuses on Deep Learning hardware
accelerators for heterogeneous HPC platforms
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
Accessible machine learning algorithms, software, and diagnostic tools for
energy-efficient devices and systems are extremely valuable across a broad
range of application domains. In scientific domains, real-time near-sensor
processing can drastically improve experimental design and accelerate
scientific discoveries. To support domain scientists, we have developed hls4ml,
an open-source software-hardware codesign workflow to interpret and translate
machine learning algorithms for implementation with both FPGA and ASIC
technologies. We expand on previous hls4ml work by extending capabilities and
techniques towards low-power implementations and increased usability: new
Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long
pipeline kernels for low power, and new device backends include an ASIC
workflow. Taken together, these and continued efforts in hls4ml will arm a new
generation of domain scientists with accessible, efficient, and powerful tools
for machine-learning-accelerated discovery.Comment: 10 pages, 8 figures, TinyML Research Symposium 202
- …