4 research outputs found
Recommended from our members
Network-on-Chip Synchronization
Technology scaling has enabled the number of cores within a System on Chip (SoC) to increase significantly. Globally Asynchronous Locally Synchronous (GALS) systems using Dynamic Voltage and Frequency Scaling (DVFS) operate each of these cores on distinct and dynamic clock domains. The main communication method between these cores is increasingly more likely to be a Network-on-Chip (NoC). Typically, the interfaces between these clock domains experience multi-cycle synchronization latencies due to their use of “brute-force” synchronizers. This dissertation aims to improve the performance of NoCs and thereby SoCs as a whole by reducing this synchronization latency.
First, a survey of NoC improvement techniques is presented. One such improvement technique: a multi-layer NoC, has been successfully simulated. Given how one of the most commonly used techniques is DVFS, a thorough analysis and simulation of brute-force synchronizer circuits in both current and future process technologies is presented. Unfortunately, a multi-cycle latency is unavoidable when using brute-force synchronizers, so predictive synchronizers which require only a single cycle of latency have been proposed.
To demonstrate the impact of these predictive synchronizer circuits at a high level, multi-core system simulations incorporating these circuits have been completed. Multiple forms of GALS NoC configurations have been simulated, including multi-synchronous, NoC-synchronous, and single-synchronizer. Speedup on the SPLASH benchmark suite was measured to directly quantify the performance benefit of predictive synchronizers in a full system. Additionally, Mean Time Between Failures (MTBF) has been calculated for each NoC synchronizer configuration to determine the reliability benefit possible when using predictive synchronizers
Refresh Triggered Computation: Improving the Energy Efficiency of Convolutional Neural Network Accelerators
To employ a Convolutional Neural Network (CNN) in an energy-constrained
embedded system, it is critical for the CNN implementation to be highly energy
efficient. Many recent studies propose CNN accelerator architectures with
custom computation units that try to improve energy-efficiency and performance
of CNNs by minimizing data transfers from DRAM-based main memory. However, in
these architectures, DRAM is still responsible for half of the overall energy
consumption of the system, on average. A key factor of the high energy
consumption of DRAM is the refresh overhead, which is estimated to consume 40%
of the total DRAM energy. In this paper, we propose a new mechanism, Refresh
Triggered Computation (RTC), that exploits the memory access patterns of CNN
applications to reduce the number of refresh operations. We propose three RTC
designs (min-RTC, mid-RTC, and full-RTC), each of which requires a different
level of aggressiveness in terms of customization to the DRAM subsystem. All of
our designs have small overhead. Even the most aggressive RTC design (i.e.,
full-RTC) imposes an area overhead of only 0.18% in a 16 Gb DRAM chip and can
have less overhead for denser chips. Our experimental evaluation on six
well-known CNNs show that RTC reduces average DRAM energy consumption by 24.4%
and 61.3%, for the least aggressive and the most aggressive RTC
implementations, respectively. Besides CNNs, we also evaluate our RTC mechanism
on three workloads from other domains. We show that RTC saves 31.9% and 16.9%
DRAM energy for Face Recognition and Bayesian Confidence Propagation Neural
Network (BCPNN), respectively. We believe RTC can be applied to other
applications whose memory access patterns remain predictable for a sufficiently
long time
Automated Debugging Methodology for FPGA-based Systems
Electronic devices make up a vital part of our lives. These are seen from mobiles, laptops, computers, home automation, etc. to name a few. The modern designs constitute billions of transistors. However, with this evolution, ensuring that the devices fulfill the designer’s expectation under variable conditions has also become a great challenge. This requires a lot of design time and effort. Whenever an error is encountered, the process is re-started. Hence, it is desired to minimize the number of spins required to achieve an error-free product, as each spin results in loss of time and effort.
Software-based simulation systems present the main technique to ensure the verification of the design before fabrication. However, few design errors (bugs) are likely to escape the simulation process. Such bugs subsequently appear during the post-silicon phase. Finding such bugs is time-consuming due to inherent invisibility of the hardware. Instead of software simulation of the design in the pre-silicon phase, post-silicon techniques permit the designers to verify the functionality through the physical implementations of the design. The main benefit of the methodology is that the implemented design in the post-silicon phase runs many order-of-magnitude faster than its counterpart in pre-silicon. This allows the designers to validate their design more exhaustively.
This thesis presents five main contributions to enable a fast and automated debugging solution for reconfigurable hardware. During the research work, we used an obstacle avoidance system for robotic vehicles as a use case to illustrate how to apply the proposed debugging solution in practical environments.
The first contribution presents a debugging system capable of providing a lossless trace of debugging data which permits a cycle-accurate replay. This methodology ensures capturing permanent as well as intermittent errors in the implemented design. The contribution also describes a solution to enhance hardware observability. It is proposed to utilize processor-configurable concentration networks, employ debug data compression to transmit the data more efficiently, and partially reconfiguring the debugging system at run-time to save the time required for design re-compilation as well as preserve the timing closure.
The second contribution presents a solution for communication-centric designs. Furthermore, solutions for designs with multi-clock domains are also discussed.
The third contribution presents a priority-based signal selection methodology to identify the signals which can be more helpful during the debugging process. A connectivity generation tool is also presented which can map the identified signals to the debugging system.
The fourth contribution presents an automated error detection solution which can help in capturing the permanent as well as intermittent errors without continuous monitoring of debugging data. The proposed solution works for designs even in the absence of golden reference.
The fifth contribution proposes to use artificial intelligence for post-silicon debugging. We presented a novel idea of using a recurrent neural network for debugging when a golden reference is present for training the network. Furthermore, the idea was also extended to designs where golden reference is not present