83 research outputs found
Recommended from our members
The Advantage of Custom Microprocessors for Stochastic Gradient Descent in Graph-Based Robot Localization and Mapping
Simultaneous Localization and Mapping (SLAM) describes a class of problems facing a large and growing field of autonomous systems -- from self-driving cars, to interplanetary rovers, to home automation products. Unfortunately this is a complex task where sophisticated algorithms and data structures are required to navigate a wide range of uncharted environments. Furthermore, most mobile robots need to run these tasks near real-time onboard an embedded controller with limited power and compute resources. To address this problem we explore the stochastic gradient descent (SGD) variant of graph solvers for SLAM and observe a tradeoff between various execution architectures and overall execution speed. Based on these observations, we propose a custom multiprocessor design that relaxes memory-coherency constraints between parallel cores while avoiding divergent behavior. We introduce a specialized streaming-tree interconnect that provides increased performance while using fewer resources compared to state-of-art GPU/CPU implementations of SGD. Finally, we discuss applications of unconventional architectural paradigms like over-provisioned ādark processorsā and specialized data partitioning that provided a unique performance advantage for our particular design
Recommended from our members
Efficient architectures and power modelling of multiresolution analysis algorithms on FPGA
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.In the past two decades, there has been huge amount of interest in Multiresolution Analysis Algorithms (MAAs) and their applications. Processing some of their applications such as medical imaging are computationally intensive, power hungry and requires large amount of memory which cause a high demand for efficient algorithm implementation, low power architecture and acceleration. Recently, some MAAs such as Finite Ridgelet Transform (FRIT) Haar Wavelet Transform (HWT) are became very popular and they are suitable for a number of image processing applications such as detection of line singularities and contiguous edges, edge detection (useful for compression and feature detection), medical image denoising and segmentation. Efficient hardware implementation and acceleration of these algorithms particularly when addressing large problems are becoming very chal-lenging and consume lot of power which leads to a number of issues including mobility, reliability concerns. To overcome the computation problems, Field Programmable Gate Arrays (FPGAs) are the technology of choice for accelerating computationally intensive applications due to their high performance. Addressing the power issue requires optimi- sation and awareness at all level of abstractions in the design flow.
The most important achievements of the work presented in this thesis are summarised
here.
Two factorisation methodologies for HWT which are called HWT Factorisation Method1 and (HWTFM1) and HWT Factorasation Method2 (HWTFM2) have been explored to increase number of zeros and reduce hardware resources. In addition, two novel efficient and optimised architectures for proposed methodologies based on Distributed Arithmetic (DA) principles have been proposed. The evaluation of the architectural results have shown that the proposed architectures results have reduced the arithmetics calculation (additions/subtractions) by 33% and 25% respectively compared to direct implementa-tion of HWT and outperformed existing results in place. The proposed HWTFM2 is implemented on advanced and low power FPGA devices using Handel-C language. The FPGAs implementation results have outperformed other existing results in terms of area and maximum frequency. In addition, a novel efficient architecture for Finite Radon Trans-form (FRAT) has also been proposed. The proposed architecture is integrated with the developed HWT architecture to build an optimised architecture for FRIT. Strategies such as parallelism and pipelining have been deployed at the architectural level for efficient im-plementation on different FPGA devices. The proposed FRIT architecture performance has been evaluated and the results outperformed some other existing architecture in place. Both FRAT and FRIT architectures have been implemented on FPGAs using Handel-C language. The evaluation of both architectures have shown that the obtained results out-performed existing results in place by almost 10% in terms of frequency and area. The proposed architectures are also applied on image data (256 Ā£ 256) and their Peak Signal to Noise Ratio (PSNR) is evaluated for quality purposes.
Two architectures for cyclic convolution based on systolic array using parallelism and pipelining which can be used as the main building block for the proposed FRIT architec-ture have been proposed. The first proposed architecture is a linear systolic array with pipelining process and the second architecture is a systolic array with parallel process. The second architecture reduces the number of registers by 42% compare to first architec-ture and both architectures outperformed other existing results in place. The proposed pipelined architecture has been implemented on different FPGA devices with vector size (N) 4,8,16,32 and word-length (W=8). The implementation results have shown a signifi-cant improvement and outperformed other existing results in place.
Ultimately, an in-depth evaluation of a high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called func-tional level power modelling approach have been presented. The mathematical techniques that form the basis of the proposed power modeling has been validated by a range of custom IP cores. The proposed power modelling is scalable, platform independent and compares favorably with existing approaches. A hybrid, top-down design flow paradigm integrating functional level power modelling with commercially available design tools for systematic optimisation of IP cores has also been developed. The in-depth evaluation of this tool enables us to observe the behavior of different custom IP cores in terms of power consumption and accuracy using different design methodologies and arithmetic techniques on virous FPGA platforms. Based on the results achieved, the proposed model accuracy is almost 99% true for all IP core's Dynamic Power (DP) components.Thomas Gerald Gray Charitable Trus
Techniques for Processing TCP/IP Flow Content in Network Switches at Gigabit Line Rates
The growth of the Internet has enabled it to become a critical component used by businesses, governments and individuals. While most of the trafļ¬c on the Internet is legitimate, a proportion of the trafļ¬c includes worms, computer viruses, network intrusions, computer espionage, security breaches and illegal behavior. This rogue trafļ¬c causes computer and network outages, reduces network throughput, and costs governments and companies billions of dollars each year. This dissertation investigates the problems associated with TCP stream processing in high-speed networks. It describes an architecture that simpliļ¬es the processing of TCP data streams in these environments and presents a hardware circuit capable of TCP stream processing on multi-gigabit networks for millions of simultaneous network connections. Live Internet trafļ¬c is analyzed using this new TCP processing circuit
Re-targetable tools and methodologies for the efficient deployment of high-level source code on coarse-grained dynamically reconfigurable architectures
Reconfigurable computing traditionally consists of a data path machine (such as an FPGA)
acting as a co-processor to a conventional microprocessor. This involves partitioning the application such that the data path intensive parts are implemented on the reconfigurable fabric, and
the control flow intensive parts are implemented on the microprocessor. Often the two parts
have to be written in different languages. New highly parallel data path architectures allow parallelism approaching that of FPGAs, but are able to be reconfigured very rapidly. As a result, it
is possible to use these architectures to perform control flow in a manner similar to a microprocessor, and thus a complete program can be described from an unmodified high-level language
(in particular C). This overcomes the historical instruction-level parallelism (ILP) wall.To make full use of the available parallelism , existing microprocessor tool flows are insufficient.
Data path machines are typically programmed via HDL tools from the ASIC design world.
This expresses algorithm s at a low er level than the application algorithm s are typically developed in. The work in this thesis builds upon earlier work to allow applications to be described
from high-level languages, by employing low-level optimisations in the compiler back-end and
working from the assembly, to maximise parallel efficiency. This consists of scheduling, where
known techniques are used to pack instructions into basic blocks that map well to the reconfigurable core (optimising spatial efficiency); then automatic pipelining is applied to dramatically
improve the achievable throughput (optimising temporal efficiency). Together these can be
thought of as āinstruction-level parallelism done rightā. Speed-ups of more than an order of
magnitude were achieved, yielding throughputs of 180-380M Pixels/s on typical image signal
processing tasks, matching the performance of hard-wired ASICs.Furthermore, conventional software-based simulation technologies for data path machines are
too slow for use in application verification. This thesis demonstrates how a high-speed software
emulator can be created for self-controlled dynamically reconfigurable data path machines,
using a static serialisation of the data paths in each configuration context. This yields run-time
performance several orders of magnitude higher than existing techniques, making it suitable for
use in feedback-directed optimisation
Increasing the efficacy of automated instruction set extension
The use of Instruction Set Extension (ISE) in customising embedded processors for a specific
application has been studied extensively in recent years. The addition of a set of complex
arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting
design performance requirements. This thesis proposes and evaluates a reconfigurable ISE
implementation called āConfigurable Flow Acceleratorsā (CFAs), a number of refinements to
an existing Automated ISE (AISE) algorithm called āISEGENā, and the effects of source form
on AISE.
The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation.
A temporal partitioning algorithm called āstaggeringā is proposed and demonstrated on average
to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration.
This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology
for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated.
Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN
early-termination is introduced and shown to improve the runtime of the algorithm by up to
7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining
is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware
heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation
of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently
espoused notion that ābigger is betterā in ISE.
The last stretch of work in this thesis is concerned with source-level transformation: the effect
of changing the representation of the application on the quality of the combined hardwaresoftware
solution. A methodology for combined exploration of source transformation and ISE
is presented, and demonstrated to improve the acceleration of the result by an average of 35%
versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all
design concerns and applications studied here, regardless of ISEs employed
Efficient architectures and power modelling of multiresolution analysis algorithms on FPGA
In the past two decades, there has been huge amount of interest in Multiresolution Analysis Algorithms (MAAs) and their applications. Processing some of their applications such as medical imaging are computationally intensive, power hungry and requires large amount of memory which cause a high demand for efficient algorithm implementation, low power architecture and acceleration. Recently, some MAAs such as Finite Ridgelet Transform (FRIT) Haar Wavelet Transform (HWT) are became very popular and they are suitable for a number of image processing applications such as detection of line singularities and contiguous edges, edge detection (useful for compression and feature detection), medical image denoising and segmentation. Efficient hardware implementation and acceleration of these algorithms particularly when addressing large problems are becoming very chal-lenging and consume lot of power which leads to a number of issues including mobility, reliability concerns. To overcome the computation problems, Field Programmable Gate Arrays (FPGAs) are the technology of choice for accelerating computationally intensive applications due to their high performance. Addressing the power issue requires optimi- sation and awareness at all level of abstractions in the design flow. The most important achievements of the work presented in this thesis are summarised here. Two factorisation methodologies for HWT which are called HWT Factorisation Method1 and (HWTFM1) and HWT Factorasation Method2 (HWTFM2) have been explored to increase number of zeros and reduce hardware resources. In addition, two novel efficient and optimised architectures for proposed methodologies based on Distributed Arithmetic (DA) principles have been proposed. The evaluation of the architectural results have shown that the proposed architectures results have reduced the arithmetics calculation (additions/subtractions) by 33% and 25% respectively compared to direct implementa-tion of HWT and outperformed existing results in place. The proposed HWTFM2 is implemented on advanced and low power FPGA devices using Handel-C language. The FPGAs implementation results have outperformed other existing results in terms of area and maximum frequency. In addition, a novel efficient architecture for Finite Radon Trans-form (FRAT) has also been proposed. The proposed architecture is integrated with the developed HWT architecture to build an optimised architecture for FRIT. Strategies such as parallelism and pipelining have been deployed at the architectural level for efficient im-plementation on different FPGA devices. The proposed FRIT architecture performance has been evaluated and the results outperformed some other existing architecture in place. Both FRAT and FRIT architectures have been implemented on FPGAs using Handel-C language. The evaluation of both architectures have shown that the obtained results out-performed existing results in place by almost 10% in terms of frequency and area. The proposed architectures are also applied on image data (256 Ā£ 256) and their Peak Signal to Noise Ratio (PSNR) is evaluated for quality purposes. Two architectures for cyclic convolution based on systolic array using parallelism and pipelining which can be used as the main building block for the proposed FRIT architec-ture have been proposed. The first proposed architecture is a linear systolic array with pipelining process and the second architecture is a systolic array with parallel process. The second architecture reduces the number of registers by 42% compare to first architec-ture and both architectures outperformed other existing results in place. The proposed pipelined architecture has been implemented on different FPGA devices with vector size (N) 4,8,16,32 and word-length (W=8). The implementation results have shown a signifi-cant improvement and outperformed other existing results in place. Ultimately, an in-depth evaluation of a high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called func-tional level power modelling approach have been presented. The mathematical techniques that form the basis of the proposed power modeling has been validated by a range of custom IP cores. The proposed power modelling is scalable, platform independent and compares favorably with existing approaches. A hybrid, top-down design flow paradigm integrating functional level power modelling with commercially available design tools for systematic optimisation of IP cores has also been developed. The in-depth evaluation of this tool enables us to observe the behavior of different custom IP cores in terms of power consumption and accuracy using different design methodologies and arithmetic techniques on virous FPGA platforms. Based on the results achieved, the proposed model accuracy is almost 99% true for all IP core's Dynamic Power (DP) components.EThOS - Electronic Theses Online ServiceThomas Gerald Gray Charitable TrustGBUnited Kingdo
- ā¦