31 research outputs found
Digital VLSI Implementation of Piecewise-Affine Controllers Based on Lattice Approach
This paper presents a small, fast, low-power consumption solution for piecewise-affine (PWA) controllers. To achieve this goal, a digital architecture for very-large-scale integration (VLSI) circuits is proposed. The implementation is based on the simplest lattice form, which eliminates the point location problem of other PWA representations and is able to provide continuous PWA controllers defined over generic partitions of the input domain. The architecture is parameterized in terms of number of inputs, outputs, signal resolution, and features of the controller to be generated. The design flows for field-programmable gate arrays and application-specific integrated circuits are detailed. Several application examples of explicit model predictive controllers (such as an adaptive cruise control and the control of a buck-boost dc-dc converter) are included to illustrate the performance of the VLSI solution obtained with the proposed lattice-based architecture
Digital VLSI Implementation of Piecewise-Affine Controllers Based on Lattice Approach
This paper presents a small, fast, low-power consumption solution for piecewise-affine (PWA) controllers. To achieve this goal, a digital architecture for very-large-scale integration (VLSI) circuits is proposed. The implementation is based on the simplest lattice form, which eliminates the point location problem of other PWA representations and is able to provide continuous PWA controllers defined over generic partitions of the input domain. The architecture is parameterized in terms of number of inputs, outputs, signal resolution, and features of the controller to be generated. The design flows for field-programmable gate arrays and application-specific integrated circuits are detailed. Several application examples of explicit model predictive controllers (such as an adaptive cruise control and the control of a buck-boost dc-dc converter) are included to illustrate the performance of the VLSI solution obtained with the proposed lattice-based architecture.Peer reviewe
High-Performance Architecture for Binary-Tree-Based Finite State Machines
A binary-tree-based finite state machine (BT-FSM)
is a state machine with a 1-bit input signal whose state transition
graph is a binary tree. BT-FSMs are useful in those
application areas where searching in a binary tree is required,
such as computer networks, compression, automatic control, or
cryptography. This paper presents a new architecture for implementing
BT-FSMs which is based on the model finite virtual state
machine (FVSM). The proposed architecture has been compared
with the general FVSM and conventional approaches by using
both synthetic test benches and very large BT-FSMs obtained
from a real application. In synthetic test benches, the average
speed improvement of the proposed architecture respect to the
best results of the other approaches achieves 41% (there are
some cases in which the speed is more than double). In the
case of the real application, the average speed improvement
achieves 155%
Customisable arithmetic hardware designs
Imperial Users onl
Improved Generation of Identifiers, Secret Keys, and Random Numbers From SRAMs
This paper presents a method to simultaneously improve the quality of the identifiers, secret keys, and random numbers that can be generated from the start-up values of standard static random access memories (SRAMs). The method is based on classifying memory cells after evaluating their start-up values at multiple measurements in a registration phase. The registration can be done without unplugging the device from its application context, and with no need for a complex laboratory setup. The method has been validated experimentally with standard low-power SRAM modules in two different application specific integrated circuits (ASICs) fabricated with the 90-nm TSMC technology. The results show that with a simple registration the length of the identifiers can be reduced by 45%, the worst case bit error probability (which defines the complexity of the error correcting code needed to recover a secret key) can be reduced by 64%, and the worst case minimum entropy value is improved, thus reducing the number of bits that have to be processed to obtain full entropy by 81%. The method can be applied to standard digital designs by controlling the external power supply to the SRAM using software or by incorporating simple circuitry in the design. In the latter case, a module for implementing the method in an ASIC designed in the 90-nm TSMC technology occupies an active area of 42, $025~mu text{m}^{mathrm {mathbf {2}}}
Loop Transformations for the Optimized Generation of Reconfigurable Hardware
Current high-level design environments offer little support to implement data-intensive applications on heterogeneous-memory systems; they rather focus on parallelism. This thesis addresses the memory hierarchy problem to high-level transformations of loop structures. The composition of long transformation sequences by combining shorter subsequences is studied together with the influence of the order of applying transformation steps. Several methods are presented to estimate bounds on Ehrhart quasi-polynomials, which can be used to statically evaluate program properties, such as memory usage. Since loop transformations not only influence the data access pattern but also the control complexity we present a hardware loop controller architecture which supports hardware generation from the polyhedral representation used for loop transformations. The techniques are demonstrated by the semi-automatic generation of an FPGA implementation of an inverse discrete wavelet transform
Accelerating Reconfigurable Financial Computing
This thesis proposes novel approaches to the design, optimisation, and management of reconfigurable
computer accelerators for financial computing. There are three contributions. First, we propose novel
reconfigurable designs for derivative pricing using both Monte-Carlo and quadrature methods. Such
designs involve exploring techniques such as control variate optimisation for Monte-Carlo, and multi-dimensional
analysis for quadrature methods. Significant speedups and energy savings are achieved
using our Field-Programmable Gate Array (FPGA) designs over both Central Processing Unit (CPU)
and Graphical Processing Unit (GPU) designs. Second, we propose a framework for distributing computing
tasks on multi-accelerator heterogeneous clusters. In this framework, different computational
devices including FPGAs, GPUs and CPUs work collaboratively on the same financial problem based
on a dynamic scheduling policy. The trade-off in speed and in energy consumption of different accelerator
allocations is investigated. Third, we propose a mixed precision methodology for optimising
Monte-Carlo designs, and a reduced precision methodology for optimising quadrature designs. These
methodologies enable us to optimise throughput of reconfigurable designs by using datapaths with
minimised precision, while maintaining the same accuracy of the results as in the original designs
High-level automation of custom hardware design for high-performance computing
This dissertation focuses on efficient generation of custom processors from high-level language descriptions. Our work exploits compiler-based optimizations and transformations in tandem with high-level synthesis (HLS) to build high-performance custom processors. The goal is to offer a common multiplatform high-abstraction programming interface for heterogeneous compute systems where the benefits of custom reconfigurable (or fixed) processors can be exploited by the application developers.
The research presented in this dissertation supports the following thesis: In an increasingly heterogeneous compute environment it is important to leverage the compute capabilities of each heterogeneous processor efficiently. In the case of FPGA and ASIC accelerators this can be achieved through HLS-based flows that (i) extract parallelism at coarser than basic block granularities, (ii) leverage common high-level parallel programming languages, and (iii) employ high-level source-to-source transformations to generate high-throughput custom processors.
First, we propose a novel HLS flow that extracts instruction level parallelism beyond the boundary of basic blocks from C code. Subsequently, we describe FCUDA, an HLS-based framework for mapping fine-grained and coarse-grained parallelism from parallel CUDA kernels onto spatial parallelism. FCUDA provides a common programming model for acceleration on heterogeneous devices (i.e. GPUs and FPGAs). Moreover, the FCUDA framework balances multilevel granularity parallelism synthesis using efficient techniques that leverage fast and accurate estimation models (i.e. do not rely on lengthy physical implementation tools). Finally, we describe an advanced source-to-source transformation framework for throughput-driven parallelism synthesis (TDPS), which appropriately restructures CUDA kernel code to maximize throughput on FPGA devices. We have integrated the TDPS framework into the FCUDA flow to enable automatic performance porting of CUDA kernels designed for the GPU architecture onto the FPGA architecture