1,190 research outputs found
Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing
Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications â such as video processing â exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the applicationâs runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat
HARDWARE ATTACK DETECTION AND PREVENTION FOR CHIP SECURITY
Hardware security is a serious emerging concern in chip designs and applications. Due to the globalization of the semiconductor design and fabrication process, integrated circuits (ICs, a.k.a. chips) are becoming increasingly vulnerable to passive and active hardware attacks. Passive attacks on chips result in secret information leaking while active attacks cause IC malfunction and catastrophic system failures. This thesis focuses on detection and prevention methods against active attacks, in particular, hardware Trojan (HT). Existing HT detection methods have limited capability to detect small-scale HTs and are further challenged by the increased process variation. We propose to use differential Cascade Voltage Switch Logic (DCVSL) method to detect small HTs and achieve a success rate of 66% to 98%. This work also presents different fault tolerant methods to handle the active attacks on symmetric-key cipher SIMON, which is a recent lightweight cipher. Simulation results show that our Even Parity Code SIMON consumes less area and power than double modular redundancy SIMON and Reversed-SIMON, but yields a higher fault -detection-failure rate as the number of concurrent faults increases. In addition, the emerging technology, memristor, is explored to protect SIMON from passive attacks. Simulation results indicate that the memristor-based SIMON has a unique power characteristic that adds new challenges on secrete key extraction
Harnessing resilience: biased voltage overscaling for probabilistic signal processing
A central component of modern computing is the idea that computation requires
determinism. Contrary to this belief, the primary contribution of this work shows that
useful computation can be accomplished in an error-prone fashion. Focusing on low-power
computing and the increasing push toward energy conservation, the work seeks to sacrifice
accuracy in exchange for energy savings.
Probabilistic computing forms the basis for this error-prone computation by diverging from the requirement of determinism and allowing for randomness within computing.
Implemented as probabilistic CMOS (PCMOS), the approach realizes enormous energy sav-
ings in applications that require probability at an algorithmic level. Extending probabilistic
computing to applications that are inherently deterministic, the biased voltage overscaling
(BIVOS) technique presented here constrains the randomness introduced through PCMOS.
Doing so, BIVOS is able to limit the magnitude of any resulting deviations and realizes
energy savings with minimal impact to application quality.
Implemented for a ripple-carry adder, array multiplier, and finite-impulse-response (FIR)
filter; a BIVOS solution substantially reduces energy consumption and does so with im-
proved error rates compared to an energy equivalent reduced-precision solution. When
applied to H.264 video decoding, a BIVOS solution is able to achieve a 33.9% reduction in
energy consumption while maintaining a peak-signal-to-noise ratio of 35.0dB (compared to
14.3dB for a comparable reduced-precision solution).
While the work presented here focuses on a specific technology, the technique realized
through BIVOS has far broader implications. It is the departure from the conventional
mindset that useful computation requires determinism that represents the primary innovation of this work. With applicability to emerging and yet to be discovered technologies,
BIVOS has the potential to contribute to computing in a variety of fashions.PhDCommittee Chair: Anderson, David; Committee Member: Conte, Thomas; Committee Member: Ferri, Bonnie; Committee Member: Hasler, Paul; Committee Member: Mooney, Vincen
On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation
Machine Learning (ML) is making a strong resurgence in tune with the massive
generation of unstructured data which in turn requires massive computational
resources. Due to the inherently compute- and power-intensive structure of
Neural Networks (NNs), hardware accelerators emerge as a promising solution.
However, with technology node scaling below 10nm, hardware accelerators become
more susceptible to faults, which in turn can impact the NN accuracy. In this
paper, we study the resilience aspects of Register-Transfer Level (RTL) model
of NN accelerators, in particular, fault characterization and mitigation. By
following a High-Level Synthesis (HLS) approach, first, we characterize the
vulnerability of various components of RTL NN. We observed that the severity of
faults depends on both i) application-level specifications, i.e., NN data
(inputs, weights, or intermediate), NN layers, and NN activation functions, and
ii) architectural-level specifications, i.e., data representation model and the
parallelism degree of the underlying accelerator. Second, motivated by
characterization results, we present a low-overhead fault mitigation technique
that can efficiently correct bit flips, by 47.3% better than state-of-the-art
methods.Comment: 8 pages, 6 figure
Design of approximate overclocked datapath
Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design.
In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors.
Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces
Fixed-Point Arithmetic in FPGA
Arithmetic operations are among the most frequently-used operations in contemporary digital integrated circuits. Various structures have been designed, utilizing different features of IC architectures. Nevertheless, there are very few studies that consider the design of arithmetic operations in Field Programmable Gate Arrays (FPGAs), a re-programmable type of digital integrated circuit. This text compares the results achieved when implementation of basic fixed-point arithmetic units in FPGA.
- âŠ