412 research outputs found
How to train accurate BNNs for embedded systems?
A key enabler of deploying convolutional neural networks on
resource-constrained embedded systems is the binary neural network (BNN). BNNs
save on memory and simplify computation by binarizing both features and
weights. Unfortunately, binarization is inevitably accompanied by a severe
decrease in accuracy. To reduce the accuracy gap between binary and
full-precision networks, many repair methods have been proposed in the recent
past, which we have classified and put into a single overview in this chapter.
The repair methods are divided into two main branches, training techniques and
network topology changes, which can further be split into smaller categories.
The latter category introduces additional cost (energy consumption or
additional area) for an embedded system, while the former does not. From our
overview, we observe that progress has been made in reducing the accuracy gap,
but BNN papers are not aligned on what repair methods should be used to get
highly accurate BNNs. Therefore, this chapter contains an empirical review that
evaluates the benefits of many repair methods in isolation over the
ResNet-20\&CIFAR10 and ResNet-18\&CIFAR100 benchmarks. We found three repair
categories most beneficial: feature binarizer, feature normalization, and
double residual. Based on this review we discuss future directions and research
opportunities. We sketch the benefit and costs associated with BNNs on embedded
systems because it remains to be seen whether BNNs will be able to close the
accuracy gap while staying highly energy-efficient on resource-constrained
embedded systems
How Flexible is Your Computing System?
In literature, computer architectures are frequently claimed to be highly flexible, typically implying the existence of trade-offs between flexibility and performance or energy efficiency. Processor flexibility, however, is not very sharply defined, and consequently these claims cannot be validated, nor can such hypothetical relations be fully understood and exploited in the design of computing systems. This paper is an attempt to introduce scientific rigour to the notion of flexibility in computing systems. A survey is conducted to provide an overview of references to flexibility in literature, both in the computer architecture domain, as well as related fields. A classification is introduced to categorize different views on flexibility, which ultimately form the foundation for a qualitative definition of flexibility. Departing from the qualitative definition of flexibility, a generic quantifiable metric is proposed, enabling valid quantitative comparison of the flexibility of various architectures. To validate the proposed method, and evaluate the relation between the proposed metric and the general notion of flexibility, the flexibility metric is measured for 25 computing systems, including CPUs, GPUs, DSPs, and FPGAs, and 40 ASIPs taken from literature. The obtained results provide insights into some of the speculative trade-offs between flexibility and properties such as energy efficiency and area efficiency. Overall the proposed quantitative flexibility metric shows to be commensurate with some generally accepted qualitative notions of flexibility collected in the survey, although some surprising discrepancies can also be observed. The proposed metric and the obtained results are placed into context of the state of the art on compute flexibility, and extensive reflection provides not only a complete overview of the field, but also discusses possible alternative approaches and open issues. Note that this work does not aim to provide a final answer to the definition of flexibility, but rather provides a framework to initiate a broader discussion in the computer architecture society on defining, understanding, and ultimately taking advantage of flexibility.</p
CGRA-EAM - Rapid Energy and Area Estimation for Coarse-grained Reconfigurable Architectures
Reconfigurable architectures are quickly gaining in popularity due to their flexibility and ability to provide high energy efficiency. However, reconfigurable systems allow for a huge design space. Iterative design space exploration (DSE) is often required to achieve good Pareto points with respect to some combination of performance, area, and/or energy. DSE tools depend on information about hardware characteristics in these aspects. These characteristics can be obtained from hardware synthesis and net-list simulation, but this is very time-consuming. Therefore, architecture models are common. This work introduces CGRA-EAM (Coarse-Grained Reconfigurable Architecture - Energy & Area Model), a model for energy and area estimation framework for coarse-grained reconfigurable architectures. The model is evaluated for the Blocks CGRA. The results demonstrate that the mean absolute percentage error is 15.5% and 2.1% for energy and area, respectively, while the model achieves a speedup of close to three orders of magnitude compared to synthesis.</p
Delay Prediction for ASIC HLS:Comparing Graph-Based and Nongraph-Based Learning Models
While high-level synthesis (HLS) tools offer faster design of hardware accelerators with different area versus delay tradeoffs, HLS-based delay estimates often deviate significantly from results obtained from ASIC logic synthesis (LS) tools. Current HLS tools rely on simple additive delay models which fail to capture the downstream optimizations performed during LS and technology mapping. Inaccurate delay estimates prevent fast and accurate design-space exploration without performing time-consuming LS tasks. In this work, we exploit different machine learning models which automatically learn to map the different downstream optimizations onto the HLS critical paths. In particular, we compare graph-based and nongraph-based learning models to investigate their efficacy, devise hybrid models to get the best of the both worlds. To carry out our learning-assisted methodology, we create a dataset of different HLS benchmarks and develop an automated framework, which extends a commercial HLS toolchain, to extract essential information from LS critical path and automatically matches this information to HLS path. This is a nontrivial task to perform manually due to difference in level of abstractions. Finally, we train the proposed hybrid models through inductive learning and integrate them in the commercial HLS toolchain to improve delay prediction accuracy. Experimental results demonstrate significant improvements in delay estimation accuracy across a wide variety of benchmark designs. We demonstrate that the graph-based models can infer essential structural features from the input design, while incorporating them into traditional nongraph-based models can significantly improve the model accuracy. Such 'hybrid' models can improve delay prediction accuracy by 93% compared to simple additive models and provide 175× speedup compared to LS. Furthermore, we discuss key insights from our experiments, identifying the influence of different HLS features on model performance.</p
Memory and Parallelism Analysis Using a Platform-Independent Approach
Emerging computing architectures such as near-memory computing (NMC) promise
improved performance for applications by reducing the data movement between CPU
and memory. However, detecting such applications is not a trivial task. In this
ongoing work, we extend the state-of-the-art platform-independent software
analysis tool with NMC related metrics such as memory entropy, spatial
locality, data-level, and basic-block-level parallelism. These metrics help to
identify the applications more suitable for NMC architectures.Comment: 22nd ACM International Workshop on Software and Compilers for
Embedded Systems (SCOPES '19), May 201
MTTR reduction of FPGA scrubbing:Exploring SEU sensitivity
SRAM-based FPGAs are widely used for developing many critical systems due to their huge capacity, flexibility, and high performance. However, their applicability in the presence of the Single Event Upsets (SEUs) should be ensured using mitigation techniques for specific critical systems e.g. space applications or automotive industry. Among all existing mitigation techniques, the scrubbing scheme is considered the most reliable technique that particularly avoids SEU accumulation in Configuration Memory (CM) that is the most SEU vulnerable component in an SRAM-based FPGAs. In spite of that, the error repair time realized by Mean Time to Repair (MTTR) attained by scrubbing is a pressing concern for real-time systems. To reduce MTTR, the impact of SEU in CM bits on correct circuit operation is considered in state-of-the-art scrubbing methods. In this paper, we examine the MTTR reduction taking into account multiple proposed precision levels to identify CM sensitive bits; i.e. bits have an adverse impact on correct circuit operation when infected by SEU. Two scrubbing methods are proposed based on these precision levels. Experimental results show an average of 20% \45% \46.5% MTTR reduction if the proposed precision-aware scrubbing methods taking into account sensitive bits recognized with low \medium \high precision compared to with very low precision. Experiments also show that higher MTTR reduction was achievable for non uniform structure like FFT (i.e about 68%). Thus the cost of distinguishing sensitive bits with higher precision is worth only when the circuit has a non-uniform structure in CM.</p
- …