621 research outputs found
LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations
We propose two tiers of modifications to FPGA logic cell architecture to
deliver a variety of performance and utilization benefits with only minor area
overheads. In the irst tier, we augment existing commercial logic cell
datapaths with a 6-input XOR gate in order to improve the expressiveness of
each element, while maintaining backward compatibility. This new architecture
is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary
tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we
refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor
tree synthesis using generalized parallel counters (GPCs) is further improved
with the proposed modifications. Using both the Intel adaptive logic module and
the Xilinx slice at the 65nm technology node for a comparative study, it is
shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We
demonstrate that LUXOR can deliver an average reduction of 13-19% in logic
utilization on micro-benchmarks from a variety of domains.BNN benchmarks
benefit the most with an average reduction of 37-47% in logic utilization,
which is due to the highly-efficient mapping of the XnorPopcount operation on
our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA,
US
Evaluation of advanced techniques for structural FPGA self-test
This thesis presents a comprehensive test generation framework for FPGA logic elements and interconnects. It is based on and extends the current state-of-the-art. The purpose of FPGA testing in this work is to achieve reliable reconfiguration for a FPGA-based runtime reconfigurable system. A pre-configuration test is performed on a portion of the FPGA before it is reconfigured as part of the system to ensure that the FPGA fabric is fault-free. The implementation platform is the Xilinx Virtex-5 FPGA family.
Existing literature in FPGA testing is evaluated and reviewed thoroughly. The various approaches are compared against one another qualitatively and the approach most suitable to the target platform is chosen. The array testing method is employed in testing the FPGA logic for its low hardware overhead and optimal test time. All tests are additionally pipelined to reduce test application time and use a high test clock frequency. A hybrid fault model including both structural and functional faults is assumed.
An algorithm for the optimization of the number of required FPGA test configurations is developed and implemented in Java using a pseudo-random set-covering heuristic. Optimal solutions are obtained for Virtex-5 logic slices. The algorithm effort is parameterizable with the number of loop iterations each of which take approximately one second for a Virtex-5 sliceL circuit.
A flexible test architecture for interconnects is developed. Arbitrary wire types can be tested in the same test configuration with no hardware overhead. Furthermore, a routing algorithm is integrated with the test template generation to select the wires under test and route them appropriately.
Nine test configurations are required to achieve full test coverage for the FPGA logic. For interconnect testing, a local router-based on depth-first graph traversal is implemented in Java as the basis for creating systematic interconnect test templates. Pent wire testing is additionally implemented as a proof of concept. The test clock frequency for all tests exceeds 170 MHz and the hardware overhead is always lower than seven CLBs. All implemented tests are parameterizable such that they can be applied to any portion of the FPGA regardless of size or position
Social Insect-Inspired Adaptive Hardware
Modern VLSI transistor densities allow large systems to be implemented within a single chip. As technologies get smaller, fundamental limits of silicon devices are reached resulting in lower design yields and post-deployment failures. Many-core systems provide a platform for leveraging the computing resource on offer by deep sub-micron technologies and also offer high-level capabilities for mitigating the issues with small feature sizes. However, designing for many-core systems that can adapt to in-field failures and operation variability requires an extremely large multi-objective optimisation space. When a many-core reaches the size supported by the densities of modern technologies (thousands of processing cores), finding design solutions in this problem space becomes extremely difficult.
Many biological systems show properties that are adaptive and scalable. This thesis proposes a self-optimising and adaptive, yet scalable, design approach for many-core based on the emergent behaviours of social-insect colonies. In these colonies there are many thousands of individuals with low intelligence who contribute, without any centralised control, to complete a wide range of tasks to build and maintain the colony. The experiments presented translate biological models of social-insect intelligence into simple embedded intelligence circuits. These circuits sense low-level system events and use this manage the parameters of the many-core's Network-on-Chip (NoC) during runtime.
Centurion, a 128-node many-core, was created to investigate these models at large scale in hardware. The results show that, by monitoring a small number of signals within each NoC router, task allocation emerges from the social-insect intelligence models that can self-configure to support representative applications. It is demonstrated that emergent task allocation supports fault tolerance with no extra hardware overhead. The response-threshold decision making circuitry uses a negligible amount of hardware resources relative to the size of the many-core and is an ideal technology for implementing embedded intelligence for system runtime management of large-complexity single-chip systems
Strategies towards high performance (high-resolution/linearity) time-to-digital converters on field-programmable gate arrays
Time-correlated single-photon counting (TCSPC) technology has become popular in scientific research and industrial applications, such as high-energy physics, bio-sensing, non-invasion health monitoring, and 3D imaging. Because of the increasing demand for high-precision time measurements, time-to-digital converters (TDCs) have attracted attention since the 1970s. As a fully digital solution, TDCs are portable and have great potential for multichannel applications compared to bulky and expensive time-to-amplitude converters (TACs). A TDC can be implemented in ASIC and FPGA devices. Due to the low cost, flexibility,
and short development cycle, FPGA-TDCs have become promising. Starting with a literature review, three original FPGA-TDCs with outstanding performance are introduced. The first design is the first efficient wave union (WU) based TDC implemented in Xilinx UltraScale (20 nm) FPGAs with a bubble-free sub-TDL structure. Combining with other existing methods, the resolution is further enhanced to 1.23 ps. The second TDC has been designed for LiDAR applications, especially in
driver-less vehicles. Using the proposed new calibration method, the resolution is adjustable (50, 80, and 100 ps), and the linearity is exceptionally high (INL pk-pk and INL pk-pk are lower than 0.05 LSB). Meanwhile, a software tool has been open-sourced with a graphic user interface (GUI) to predict TDCs’ performance. In the third TDC, an
onboard automatic calibration (AC) function has been realized by exploiting Xilinx ZYNQ SoC architectures. The test results show the robustness of the proposed method. Without the manual calibration, the AC function enables FPGA-TDCs to be applied in commercial products where mass production is required.Time-correlated single-photon counting (TCSPC) technology has become popular in scientific research and industrial applications, such as high-energy physics, bio-sensing, non-invasion health monitoring, and 3D imaging. Because of the increasing demand for high-precision time measurements, time-to-digital converters (TDCs) have attracted attention since the 1970s. As a fully digital solution, TDCs are portable and have great potential for multichannel applications compared to bulky and expensive time-to-amplitude converters (TACs). A TDC can be implemented in ASIC and FPGA devices. Due to the low cost, flexibility,
and short development cycle, FPGA-TDCs have become promising. Starting with a literature review, three original FPGA-TDCs with outstanding performance are introduced. The first design is the first efficient wave union (WU) based TDC implemented in Xilinx UltraScale (20 nm) FPGAs with a bubble-free sub-TDL structure. Combining with other existing methods, the resolution is further enhanced to 1.23 ps. The second TDC has been designed for LiDAR applications, especially in
driver-less vehicles. Using the proposed new calibration method, the resolution is adjustable (50, 80, and 100 ps), and the linearity is exceptionally high (INL pk-pk and INL pk-pk are lower than 0.05 LSB). Meanwhile, a software tool has been open-sourced with a graphic user interface (GUI) to predict TDCs’ performance. In the third TDC, an
onboard automatic calibration (AC) function has been realized by exploiting Xilinx ZYNQ SoC architectures. The test results show the robustness of the proposed method. Without the manual calibration, the AC function enables FPGA-TDCs to be applied in commercial products where mass production is required
Dynamic Partial Reconfiguration for Dependable Systems
Moore’s law has served as goal and motivation for consumer electronics manufacturers in the last decades. The results in terms of processing power increase in the consumer electronics devices have been mainly achieved due to cost reduction and technology shrinking. However, reducing physical geometries mainly affects the electronic devices’ dependability, making them more sensitive to soft-errors like Single Event Transient (SET) of Single Event Upset (SEU) and hard (permanent) faults, e.g. due to aging effects.
Accordingly, safety critical systems often rely on the adoption of old technology nodes, even if they introduce longer design time w.r.t. consumer electronics. In fact, functional safety requirements are increasingly pushing industry in developing innovative methodologies to design high-dependable systems with the required diagnostic coverage. On the other hand commercial off-the-shelf (COTS) devices adoption began to be considered for safety-related systems due to real-time requirements, the need for the implementation of computationally hungry algorithms and lower design costs. In this field FPGA market share is constantly increased, thanks to their flexibility and low non-recurrent engineering costs, making them suitable for a set of safety critical applications with low production volumes.
The works presented in this thesis tries to face new dependability issues in modern reconfigurable systems, exploiting their special features to take proper counteractions with low impacton performances, namely Dynamic Partial Reconfiguration
Delay Measurements and Self Characterisation on FPGAs
This thesis examines new timing measurement methods for self delay characterisation of Field-Programmable Gate Arrays (FPGAs) components and delay measurement of complex circuits
on FPGAs. Two novel measurement techniques based on analysis of a circuit's output failure
rate and transition probability is proposed for accurate, precise and efficient measurement of
propagation delays. The transition probability based method is especially attractive, since
it requires no modifications in the circuit-under-test and requires little hardware resources,
making it an ideal method for physical delay analysis of FPGA circuits.
The relentless advancements in process technology has led to smaller and denser transistors
in integrated circuits. While FPGA users benefit from this in terms of increased hardware
resources for more complex designs, the actual productivity with FPGA in terms of timing
performance (operating frequency, latency and throughput) has lagged behind the potential
improvements from the improved technology due to delay variability in FPGA components
and the inaccuracy of timing models used in FPGA timing analysis. The ability to measure
delay of any arbitrary circuit on FPGA offers many opportunities for on-chip characterisation
and physical timing analysis, allowing delay variability to be accurately tracked and variation-aware optimisations to be developed, reducing the productivity gap observed in today's FPGA
designs.
The measurement techniques are developed into complete self measurement and characterisation platforms in this thesis, demonstrating their practical uses in actual FPGA hardware for
cross-chip delay characterisation and accurate delay measurement of both complex combinatorial and sequential circuits, further reinforcing their positions in solving the delay variability
problem in FPGAs
A configurable decoder for pin-limited applications
Pin limitation is the restriction imposed on an IC chip by the unavailability of a sufficient number of I/O pins. This impacts the design and performance of the chip, as the amount of information that can be passed through the boundary of the chip becomes limited. One area that would benefit from a reduction of the effect of pin limitation is reconfigurable architectures. In this work, we consider reconfigurable devices called Field Programmable Gate Arrays (FPGAs). Due to pin limitation, current FPGAs use a form of 1-hot decoder to select elements (one frame at a time) during partial reconfiguration. This results in a slow and coarse selection of elements for reconfiguration. We propose a module that performs a focused selection of only those elements that require reconfiguration. This reduces reconfiguration overheads and enables the speeds needed for dynamic reconfiguration.
The problem is that of selecting subsets of an n-element set in a fast, focused and inexpensive manner. This thesis proposes such a configurable decoder that bridges the gap between the inexpensive, but inflexible, fixed 1-hot decoder, and the expensive, but flexible, pure LUT-based decoder. Our configurable decoder uses a LUT with a narrow output and a low cost in tandem with a special fixed decoder called a mapping unit that expands the output of the LUT to a desired n-bit output. We demonstrate several implementations of the mapping unit, each with different capabilities and trade-offs. A key result of this work is that for any gate cost G=O(n logk n) (where k is a constant), if a pure LUT-based solution produces λ independent subsets, then our method produces Ω(λ log n / log log n) independent subsets for the same cost. Our decoder also produces many more dependent subsets (that depend on the choice of the Ω( λ log n / log log n) independent subsets).
We provide simulation results for the configurable decoder and predict future trends from the simulation data; these confirm the theoretical advantages of the proposed decoder. We illustrate the implementation of important subset classes on our configurable decoder and make key observations on a generalized variant
Hybrid FPGA: Architecture and Interface
Hybrid FPGAs (Field Programmable Gate Arrays) are composed of general-purpose logic resources
with different granularities, together with domain-specific coarse-grained units. This thesis proposes
a novel hybrid FPGA architecture with embedded coarse-grained Floating Point Units (FPUs) to
improve the floating point capability of FPGAs. Based on the proposed hybrid FPGA architecture,
we examine three aspects to optimise the speed and area for domain-specific applications.
First, we examine the interface between large coarse-grained embedded blocks (EBs) and fine-grained
elements in hybrid FPGAs. The interface includes parameters for varying: (1) aspect ratio of EBs,
(2) position of the EBs in the FPGA, (3) I/O pins arrangement of EBs, (4) interconnect flexibility of
EBs, and (5) location of additional embedded elements such as memory.
Second, we examine the interconnect structure for hybrid FPGAs. We investigate how large and highdensity
EBs affect the routing demand for hybrid FPGAs over a set of domain-specific applications.
We then propose three routing optimisation methods to meet the additional routing demand introduced
by large EBs: (1) identifying the best separation distance between EBs, (2) adding routing switches on
EBs to increase routing flexibility, and (3) introducing wider channel width near the edge of EBs. We
study and compare the trade-offs in delay, area and routability of these three optimisation methods.
Finally, we employ common subgraph extraction to determine the number of floating point adders/subtractors,
multipliers and wordblocks in the FPUs. The wordblocks include registers and can implement fixed
point operations. We study the area, speed and utilisation trade-offs of the selected FPU subgraphs
in a set of floating point benchmark circuits. We develop an optimised coarse-grained FPU, taking
into account both architectural and system-level issues. Furthermore, we investigate the trade-offs
between granularities and performance by composing small FPUs into a large FPU.
The results of this thesis would help design a domain-specific hybrid FPGA to meet user requirements,
by optimising for speed, area or a combination of speed and area
- …