1,076 research outputs found
AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming
With the increasing application of machine learning (ML) algorithms in
embedded systems, there is a rising necessity to design low-cost computer
arithmetic for these resource-constrained systems. As a result, emerging models
of computation, such as approximate and stochastic computing, that leverage the
inherent error-resilience of such algorithms are being actively explored for
implementing ML inference on resource-constrained systems. Approximate
computing (AxC) aims to provide disproportionate gains in the power,
performance, and area (PPA) of an application by allowing some level of
reduction in its behavioral accuracy (BEHAV). Using approximate operators
(AxOs) for computer arithmetic forms one of the more prevalent methods of
implementing AxC. AxOs provide the additional scope for finer granularity of
optimization, compared to only precision scaling of computer arithmetic. To
this end, designing platform-specific and cost-efficient approximate operators
forms an important research goal. Recently, multiple works have reported using
AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of
such works limit usage of AI/ML to designing ML-based surrogate functions used
during iterative optimization processes. To this end, we propose a novel data
analysis-driven mathematical programming-based approach to synthesizing
approximate operators for FPGAs. Specifically, we formulate mixed integer
quadratically constrained programs based on the results of correlation analysis
of the characterization data and use the solutions to enable a more directed
search approach for evolutionary optimization algorithms. Compared to
traditional evolutionary algorithms-based optimization, we report up to 21%
improvement in the hypervolume, for joint optimization of PPA and BEHAV, in the
design of signed 8-bit multipliers.Comment: 23 pages, Under review at ACM TRET
On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation
Machine Learning (ML) is making a strong resurgence in tune with the massive
generation of unstructured data which in turn requires massive computational
resources. Due to the inherently compute- and power-intensive structure of
Neural Networks (NNs), hardware accelerators emerge as a promising solution.
However, with technology node scaling below 10nm, hardware accelerators become
more susceptible to faults, which in turn can impact the NN accuracy. In this
paper, we study the resilience aspects of Register-Transfer Level (RTL) model
of NN accelerators, in particular, fault characterization and mitigation. By
following a High-Level Synthesis (HLS) approach, first, we characterize the
vulnerability of various components of RTL NN. We observed that the severity of
faults depends on both i) application-level specifications, i.e., NN data
(inputs, weights, or intermediate), NN layers, and NN activation functions, and
ii) architectural-level specifications, i.e., data representation model and the
parallelism degree of the underlying accelerator. Second, motivated by
characterization results, we present a low-overhead fault mitigation technique
that can efficiently correct bit flips, by 47.3% better than state-of-the-art
methods.Comment: 8 pages, 6 figure
Recommended from our members
On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for today’s many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the network’s switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each output’s own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch
AxOCS: Scaling FPGA-based Approximate Operators using Configuration Supersampling
The rising usage of AI and ML-based processing across application domains has
exacerbated the need for low-cost ML implementation, specifically for
resource-constrained embedded systems. To this end, approximate computing, an
approach that explores the power, performance, area (PPA), and behavioral
accuracy (BEHAV) trade-offs, has emerged as a possible solution for
implementing embedded machine learning. Due to the predominance of MAC
operations in ML, designing platform-specific approximate arithmetic operators
forms one of the major research problems in approximate computing. Recently
there has been a rising usage of AI/ML-based design space exploration
techniques for implementing approximate operators. However, most of these
approaches are limited to using ML-based surrogate functions for predicting the
PPA and BEHAV impact of a set of related design decisions. While this approach
leverages the regression capabilities of ML methods, it does not exploit the
more advanced approaches in ML. To this end, we propose AxOCS, a methodology
for designing approximate arithmetic operators through ML-based supersampling.
Specifically, we present a method to leverage the correlation of PPA and BEHAV
metrics across operators of varying bit-widths for generating larger bit-width
operators. The proposed approach involves traversing the relatively smaller
design space of smaller bit-width operators and employing its associated
Design-PPA-BEHAV relationship to generate initial solutions for
metaheuristics-based optimization for larger operators. The experimental
evaluation of AxOCS for FPGA-optimized approximate operators shows that the
proposed approach significantly improves the quality-resulting hypervolume for
multi-objective optimization-of 8x8 signed approximate multipliers.Comment: 11 pages, under review with IEEE TCAS-
Design Space Exploration of Sparsity-Aware Application-Specific Spiking Neural Network Accelerators
Spiking Neural Networks (SNNs) offer a promising alternative to Artificial
Neural Networks (ANNs) for deep learning applications, particularly in
resource-constrained systems. This is largely due to their inherent sparsity,
influenced by factors such as the input dataset, the length of the spike train,
and the network topology. While a few prior works have demonstrated the
advantages of incorporating sparsity into the hardware design, especially in
terms of reducing energy consumption, the impact on hardware resources has not
yet been explored. This is where design space exploration (DSE) becomes
crucial, as it allows for the optimization of hardware performance by tailoring
both the hardware and model parameters to suit specific application needs.
However, DSE can be extremely challenging given the potentially large design
space and the interplay of hardware architecture design choices and
application-specific model parameters.
In this paper, we propose a flexible hardware design that leverages the
sparsity of SNNs to identify highly efficient, application-specific accelerator
designs. We develop a high-level, cycle-accurate simulation framework for this
hardware and demonstrate the framework's benefits in enabling detailed and
fine-grained exploration of SNN design choices, such as the layer-wise
logical-to-hardware ratio (LHR). Our experimental results show that our design
can (i) achieve up to reduction in hardware resources and (ii) deliver a
speed increase of up to , while requiring fewer hardware
resources compared to sparsity-oblivious designs. We further showcase the
robustness of our framework by varying spike train lengths with different
neuron population sizes to find the optimal trade-off points between accuracy
and hardware latency
- …