1,402 research outputs found
Recommended from our members
Efficient FPGA implementation and power modelling of image and signal processing IP cores
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage
and signal processing application areas such as consumer electronics, instrumentation,
medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA
devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the
work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of
cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area.
A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM
is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
Low power techniques for video compression
This paper gives an overview of low-power techniques proposed in the literature for mobile multimedia and Internet applications. Exploitable aspects are discussed in the behavior of different video compression tools. These power-efficient solutions are then classified by synthesis domain and level of abstraction. As this paper is meant to be a starting point for further research in the area, a lowpower hardware & software co-design methodology is outlined in the end as a possible scenario for video-codec-on-a-chip implementations on future mobile multimedia platforms
Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code
Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance.
In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented
Recommended from our members
Efficient architectures and power modelling of multiresolution analysis algorithms on FPGA
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.In the past two decades, there has been huge amount of interest in Multiresolution Analysis Algorithms (MAAs) and their applications. Processing some of their applications such as medical imaging are computationally intensive, power hungry and requires large amount of memory which cause a high demand for efficient algorithm implementation, low power architecture and acceleration. Recently, some MAAs such as Finite Ridgelet Transform (FRIT) Haar Wavelet Transform (HWT) are became very popular and they are suitable for a number of image processing applications such as detection of line singularities and contiguous edges, edge detection (useful for compression and feature detection), medical image denoising and segmentation. Efficient hardware implementation and acceleration of these algorithms particularly when addressing large problems are becoming very chal-lenging and consume lot of power which leads to a number of issues including mobility, reliability concerns. To overcome the computation problems, Field Programmable Gate Arrays (FPGAs) are the technology of choice for accelerating computationally intensive applications due to their high performance. Addressing the power issue requires optimi- sation and awareness at all level of abstractions in the design flow.
The most important achievements of the work presented in this thesis are summarised
here.
Two factorisation methodologies for HWT which are called HWT Factorisation Method1 and (HWTFM1) and HWT Factorasation Method2 (HWTFM2) have been explored to increase number of zeros and reduce hardware resources. In addition, two novel efficient and optimised architectures for proposed methodologies based on Distributed Arithmetic (DA) principles have been proposed. The evaluation of the architectural results have shown that the proposed architectures results have reduced the arithmetics calculation (additions/subtractions) by 33% and 25% respectively compared to direct implementa-tion of HWT and outperformed existing results in place. The proposed HWTFM2 is implemented on advanced and low power FPGA devices using Handel-C language. The FPGAs implementation results have outperformed other existing results in terms of area and maximum frequency. In addition, a novel efficient architecture for Finite Radon Trans-form (FRAT) has also been proposed. The proposed architecture is integrated with the developed HWT architecture to build an optimised architecture for FRIT. Strategies such as parallelism and pipelining have been deployed at the architectural level for efficient im-plementation on different FPGA devices. The proposed FRIT architecture performance has been evaluated and the results outperformed some other existing architecture in place. Both FRAT and FRIT architectures have been implemented on FPGAs using Handel-C language. The evaluation of both architectures have shown that the obtained results out-performed existing results in place by almost 10% in terms of frequency and area. The proposed architectures are also applied on image data (256 Ā£ 256) and their Peak Signal to Noise Ratio (PSNR) is evaluated for quality purposes.
Two architectures for cyclic convolution based on systolic array using parallelism and pipelining which can be used as the main building block for the proposed FRIT architec-ture have been proposed. The first proposed architecture is a linear systolic array with pipelining process and the second architecture is a systolic array with parallel process. The second architecture reduces the number of registers by 42% compare to first architec-ture and both architectures outperformed other existing results in place. The proposed pipelined architecture has been implemented on different FPGA devices with vector size (N) 4,8,16,32 and word-length (W=8). The implementation results have shown a signifi-cant improvement and outperformed other existing results in place.
Ultimately, an in-depth evaluation of a high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called func-tional level power modelling approach have been presented. The mathematical techniques that form the basis of the proposed power modeling has been validated by a range of custom IP cores. The proposed power modelling is scalable, platform independent and compares favorably with existing approaches. A hybrid, top-down design flow paradigm integrating functional level power modelling with commercially available design tools for systematic optimisation of IP cores has also been developed. The in-depth evaluation of this tool enables us to observe the behavior of different custom IP cores in terms of power consumption and accuracy using different design methodologies and arithmetic techniques on virous FPGA platforms. Based on the results achieved, the proposed model accuracy is almost 99% true for all IP core's Dynamic Power (DP) components.Thomas Gerald Gray Charitable Trus
Automated Hardware Prototyping for 3D Network on Chips
Vor mehr als 50 Jahren stellte IntelĀ® MitbegrĆ¼nder Gordon Moore eine Prognose zum Entwicklungsprozess der Transistortechnologie auf. Er prognostizierte, dass sich die Zahl der Transistoren in integrierten Schaltungen alle zwei Jahre verdoppeln wird. Seine Aussage ist immer noch gĆ¼ltig, aber ein Ende von Moores Gesetz ist in Sicht. Mit dem Ende von Mooreās Gesetz mĆ¼ssen neue Aspekte untersucht werden, um weiterhin die Leistung von integrierten Schaltungen zu steigern. Zwei mƶgliche AnsƤtze fĆ¼r "More than Mooreā sind 3D-Integrationsverfahren und heterogene Systeme. Gleichzeitig entwickelt sich ein Trend hin zu Multi-Core Prozessoren, basierend auf Networks on chips (NoCs).
Neben dem Ende des Mooreschen Gesetzes ergeben sich bei immer kleiner werdenden TechnologiegrƶĆen, vor allem jenseits der 60 nm, neue Herausforderungen. Eine Schwierigkeit ist die WƤrmeableitung in groĆskalierten integrierten Schaltkreisen und die daraus resultierende Ćberhitzung des Chips. Um diesem Problem in modernen Multi-Core Architekturen zu begegnen, muss auch die Verlustleistung der Netzwerkressourcen stark reduziert werden. Diese Arbeit umfasst eine durch Hardware gesteuerte Kombination aus Frequenzskalierung und Power Gating fĆ¼r 3D On-Chip Netzwerke, einschlieĆlich eines FPGA Prototypen. DafĆ¼r wurde ein Takt-synchrones 2D Netzwerk auf ein dreidimensionales asynchrones Netzwerk mit mehreren Frequenzbereichen erweitert. ZusƤtzlich wurde ein skalierbares Online-Power-Management System mit geringem Ressourcenaufwand entwickelt.
Die Verifikation neuer Hardwarekomponenten ist einer der zeitaufwendigsten Schritte im Entwicklungsprozess hochintegrierter digitaler Schaltkreise. Um diese Aufgabe zu beschleunigen und um eine parallele Softwareentwicklung zu ermƶglichen, wurde im Rahmen dieser Arbeit ein automatisiertes und benutzerfreundliches Tool fĆ¼r den Entwurf neuer Hardware Projekte entwickelt. Eine grafische BenutzeroberflƤche zum Erstellen des gesamten Designablaufs, vom Erstellen der Architektur, Parameter Deklaration, Simulation, Synthese und Test ist Teil dieses Werkzeugs. Zudem stellt die GrƶĆe der Architektur fĆ¼r die Erstellung eines Prototypen eine besondere Herausforderung dar. FrĆ¼here Arbeiten haben es versƤumt, eine schnelles und unkompliziertes Prototyping, insbesondere von Architekturen mit mehr als 50 Prozessorkernen, zu realisieren. Diese Arbeit umfasst eine Design Space Exploration und FPGA-basierte Prototypen von verschiedenen 3D-NoC Implementierungen mit mehr als 80 Prozessoren
Circuit design and analysis for on-FPGA communication systems
On-chip communication system has emerged as a prominently important subject in Very-Large-
Scale-Integration (VLSI) design, as the trend of technology scaling favours logics more than interconnects.
Interconnects often dictates the system performance, and, therefore, research for new
methodologies and system architectures that deliver high-performance communication services
across the chip is mandatory. The interconnect challenge is exacerbated in Field-Programmable
Gate Array (FPGA), as a type of ASIC where the hardware can be programmed post-fabrication.
Communication across an FPGA will be deteriorating as a result of interconnect scaling. The programmable
fabrics, switches and the specific routing architecture also introduce additional latency
and bandwidth degradation further hindering intra-chip communication performance.
Past research efforts mainly focused on optimizing logic elements and functional units in FPGAs.
Communication with programmable interconnect received little attention and is inadequately understood.
This thesis is among the first to research on-chip communication systems that are built on
top of programmable fabrics and proposes methodologies to maximize the interconnect throughput
performance. There are three major contributions in this thesis: (i) an analysis of on-chip
interconnect fringing, which degrades the bandwidth of communication channels due to routing
congestions in reconfigurable architectures; (ii) a new analogue wave signalling scheme that significantly
improves the interconnect throughput by exploiting the fundamental electrical characteristics
of the reconfigurable interconnect structures. This new scheme can potentially mitigate
the interconnect scaling challenges. (iii) a novel Dynamic Programming (DP)-network to provide
adaptive routing in network-on-chip (NoC) systems. The DP-network architecture performs runtime
optimization for route planning and dynamic routing which, effectively utilizes the in-silicon
bandwidth. This thesis explores a new horizon in reconfigurable system design, in which new
methodologies and concepts are proposed to enhance the on-FPGA communication throughput
performance that is of vital importance in new technology processes
- ā¦