180 research outputs found
Interconnect yield analysis and fault tolerance for field programmable gate arrays
Imperial Users onl
Mapping a guided image filter on the HARP reconfigurable architecture using OpenCL
Intel recently introduced the Heterogeneous Architecture Research Platform, HARP. In this platform, the Central Processing Unit and a Field-Programmable Gate Array are connected through a high-bandwidth, low-latency interconnect and both share DRAM memory. For this platform, Open Computing Language (OpenCL), a High-Level Synthesis (HLS) language, is made available. By making use of HLS, a faster design cycle can be achieved compared to programming in a traditional hardware description language. This, however, comes at the cost of having less control over the hardware implementation. We will investigate how OpenCL can be applied to implement a real-time guided image filter on the HARP platform. In the first phase, the performance-critical parameters of the OpenCL programming model are defined using several specialized benchmarks. In a second phase, the guided image filter algorithm is implemented using the insights gained in the first phase. Both a floating-point and a fixed-point implementation were developed for this algorithm, based on a sliding window implementation. This resulted in a maximum floating-point performance of 135 GFLOPS, a maximum fixed-point performance of 430 GOPS and a throughput of HD color images at 74 frames per second
Fault-tolerant fpga for mission-critical applications.
One of the devices that play a great role in electronic circuits design, specifically safety-critical design applications, is Field programmable Gate Arrays (FPGAs). This is because of its high performance, re-configurability and low development cost. FPGAs are used in many applications such as data processing, networks, automotive, space and industrial applications. Negative impacts on the reliability of such applications result from moving to smaller feature sizes in the latest FPGA architectures. This increases the need for fault-tolerant techniques to improve reliability and extend system lifetime of FPGA-based applications. In this thesis, two fault-tolerant techniques for FPGA-based applications are proposed with a built-in fault detection region. A low cost fault detection scheme is proposed for detecting faults using the fault detection region used in both schemes. The fault detection scheme primarily detects open faults in the programmable interconnect resources in the FPGAs. In addition, Stuck-At faults and Single Event Upsets (SEUs) fault can be detected. For fault recovery, each scheme has its own fault recovery approach. The first approach uses a spare module and a 2-to-1 multiplexer to recover from any fault detected. On the other hand, the second approach recovers from any fault detected using the property of Partial Reconfiguration (PR) in the FPGAs. It relies on identifying a Partially Reconfigurable block (P_b) in the FPGA that is used in the recovery process after the first faulty module is identified in the system. This technique uses only one location to recover from faults in any of the FPGA’s modules and the FPGA interconnects. Simulation results show that both techniques can detect and recover from open faults. In addition, Stuck-At faults and Single Event Upsets (SEUs) fault can also be detected. Finally, both techniques require low area overhead
Producing Random Bits with Delay-Line Based Ring Oscillators
One of the sources of randomness for a random bit generator (RBG) is jitter present in rectangular signals produced by ring oscillators (ROs). This paper presents a novel approach for the design of delays used in these oscillators. We suggest using delay elements made on carry4 primitives instead of series of inverters or latches considered in the literature. It enables the construction of many high frequency ring oscillators with different nominal frequencies in the same field programmable gate array (FPGA). To assess the unpredictability of bits produced by RO-based RBG, the restarts mechanism, proposed in earlier papers, was used. The output sequences pass all NIST 800-22 statistical tests for smaller number of ring oscillators than the constructions described in the literature. Due to the number of ROs with different nominal frequencies and the method of construction of carry4 primitives, it is expected that the proposed RBG is more robust to cryptographic attacks than RBGs using inverters or latches as delay element
Rethinking FPGA Architectures for Deep Neural Network applications
The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance.
Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking.
First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks
Using embedded hardware monitor cores in critical computer systems
The integration of FPGA devices in many different architectures and services
makes monitoring and real time detection of errors an important concern in FPGA
system design. A monitor is a tool, or a set of tools, that facilitate analytic
measurements in observing a given system. The goal of these observations is
usually the performance analysis and optimisation, or the surveillance of the system.
However, System-on-Chip (SoC) based designs leave few points to attach external
tools such as logic analysers. Thus, an embedded error detection core that allows
observation of critical system nodes (such as processor cores and buses) should
enforce the operation of the FPGA-based system, in order to prevent system
failures. The core should not interfere with system performance and must ensure
timely detection of errors.
This thesis is an investigation onto how a robust hardware-monitoring module
can be efficiently integrated in a target PCI board (with FPGA-based application processing
features) which is part of a critical computing system. [Continues.
FPgrep and FPsed: Packet Payload Processors for Managing the Flow of Digital Content on Local Area Networks and the Internet
As computer networks increase in speed, it becomes difficult to monitor and manage the transmitted digital content. To alleviate these problems, hardware-based search (FPgrep) and search-and-replace (FPsed) modules have been developed. FP-grep has the ability to scan packet payloads for a given set of regular expressions and pass or drop packets based on the payload contents. FPsed also scans packet payloads for a set of regular expressions and adds the ability to modify the payload if desired. The hardware circuits that implement the FPgrep and FPsed modules can be generated, compiled, and synthesized using a simple web interface. Once a module is created it is programmed into logic on a Field Programmable Gate Array (FPGA). The FPgrep and FPsed modules use FPGAs to process packets at the full rate of Gigabit-speed networks. Both modules, along with several supporting applications were developed and tested using the Field Programmable Port Extender (FPX) platform. Applications developed for the modules currently include a spam filter, virus protection, an information security filter, as well as a copyright enforcement function
Generic low power reconfigurable distributed arithmetic processor
Higher performance, lower cost, increasingly minimizing integrated circuit components, and
higher packaging density of chips are ongoing goals of the microelectronic and computer
industry. As these goals are being achieved, however, power consumption and flexibility are
increasingly becoming bottlenecks that need to be addressed with the new technology in Very
Large-Scale Integrated (VLSI) design.
For modern systems, more energy is required to support the powerful computational capability
which accords with the increasing requirements, and these requirements cause the change of
standards not only in audio and video broadcasting but also in communication such as wireless
connection and network protocols. Powerful flexibility and low consumption are repellent, but
their combination in one system is the ultimate goal of designers.
A generic domain-specific low-power reconfigurable processor for the distributed
arithmetic algorithm is presented in this dissertation. This domain reconfigurable processor
features high efficiency in terms of area, power and delay, which approaches the
performance of an ASIC design, while retaining the flexibility of programmable platforms.
The architecture not only supports typical distributed arithmetic algorithms which can be
found in most still picture compression standards and video conferencing standards, but
also offers implementation ability for other distributed arithmetic algorithms found in
digital signal processing, telecommunication protocols and automatic control.
In this processor, a simple reconfigurable low power control unit is implemented with
good performance in area, power and timing. The generic characteristic of the architecture
makes it applicable for any small and medium size finite state machines which can be used
as control units to implement complex system behaviour and can be found in almost all
engineering disciplines. Furthermore, to map target applications efficiently onto the
proposed architecture, a new algorithm is introduced for searching for the best common
sharing terms set and it keeps the area and power consumption of the implementation at
low level. The software implementation of this algorithm is presented, which can be used
not only for the proposed architecture in this dissertation but also for all the
implementations with adder-based distributed arithmetic algorithms. In addition, some low
power design techniques are applied in the architecture, such as unsymmetrical design
style including unsymmetrical interconnection arranging, unsymmetrical PTBs selection
and unsymmetrical mapping basic computing units. All these design techniques achieve
extraordinary power consumption saving. It is believed that they can be extended to more
low power designs and architectures.
The processor presented in this dissertation can be used to implement complex, high
performance distributed arithmetic algorithms for communication and image processing
applications with low cost in area and power compared with the traditional
methods
Optimising runtime reconfigurable designs for high performance applications
This thesis proposes novel optimisations for high performance runtime reconfigurable designs.
For a reconfigurable design,
the proposed approach investigates idle resources introduced by static design approaches,
and exploits runtime reconfiguration to eliminate the inefficient resources.
The approach covers the circuit level, the function level, and the system level.
At the circuit level, a method is proposed for tuning reconfigurable designs with two analytical models:
a resource model for computational and memory resources and memory bandwidth,
and a performance model for estimating execution time.
This method is applied to tuning implementations of finite-difference algorithms,
optimising arithmetic operators and memory bandwidth based on algorithmic parameters,
and eliminating idle resources by runtime reconfiguration.
At the function level, a method is proposed to automatically identify and exploit runtime
reconfiguration opportunities while optimising resource utilisation.
The method is based on Reconfiguration Data Flow Graph,
a new hierarchical graph structure enabling runtime reconfigurable designs to be synthesised in three steps:
function analysis, configuration organisation, and runtime solution generation.
At the system level, a method is proposed for optimising reconfigurable designs by dynamically adapting the designs to available runtime resources in a reconfigurable system. This method includes two steps: compile-time optimisation and runtime scaling, which enable efficient workload distribution, asynchronous communication scheduling, and domain-specific optimisations. It can be used in developing effective servers for high performance applications.Open Acces
- …