Search CORE

5 research outputs found

Energy-Efficient Digital Signal Processing Hardware Design.

Author: Jeon Dongsuk
Publication venue
Publication date: 01/01/2014
Field of study

As CMOS technology has developed considerably in the last few decades, many SoCs have been implemented across different application areas due to reduced area and power consumption. Digital signal processing (DSP) algorithms are frequently employed in these systems to achieve more accurate operation or faster computation. However, CMOS technology scaling started to slow down recently and relatively large systems consume too much power to rely only on the scaling effect while system power budget such as battery capacity improves slowly. In addition, there exist increasing needs for miniaturized computing systems including sensor nodes that can accomplish similar operations with significantly smaller power budget. Voltage scaling is one of the most promising power saving techniques due to quadratic switching power reduction effect, making it necessary feature for even high-end processors. However, in order to achieve maximum possible energy efficiency, systems should operate in near or sub-threshold regimes where leakage takes significant portion of power. In this dissertation, a few key energy-aware design approaches are described. Considering prominent leakage and larger PVT variability in low operating voltages, multi-level energy saving techniques to be described are applied to key building blocks in DSP applications: architecture study, algorithm-architecture co-optimization, and robust yet low-power memory design. Finally, described approaches are applied to design examples including a visual navigation accelerator, ultra-low power biomedical SoC and face detection/recognition processor, resulting in 2~100 times power savings than state-of-the-art.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110496/1/djeon_1.pd

Deep Blue Documents at the University of Michigan

Design of Energy Efficient and Dependable Health Monitoring Systems under Unreliable Nanometer Technologies

Author: Aly Mohamed Mostafa Sabry
Atienza Alonso David
Burg Andreas Peter
Karakonstantis Georgios
Publication venue: 'European Alliance for Innovation n.o.'
Publication date: 01/01/2012
Field of study

In this paper we investigate the impact of potential hardware misbehavior induced by reliability issues and scaled voltages in wireless body sensor network (WBSN) nodes. Our study reveals the inherent resilience of popular algorithms in cardiac monitoring applications and argues that by exploiting the unique characteristics of such algorithms the energy efficiency and reliability of such systems can be significantly improved. This is achieved by developing a cross-layer design paradigm that utilizes low cost techniques at the hardware and software layers and by optimizing the synergy between them in order to provide intelligent trade-offs between energy, performance and quality. The main idea of the proposed approach is the selective application of costly robust techniques only to the most critical tasks identified at the application layer that are detrimental for obtaining sufficient output quality. Our results show that by ensuring the correct operation of only 37% of the total computations in an electrocardiogram (ECG) monitoring WBSN node we can achieve up to 70% power savings with only 9% degradation in ECG output quality

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Design of robust ultra-low power platform for in-silicon machine learning

Author: Zhang Sai
Publication venue
Publication date: 01/12/2016
Field of study

The rapid development of machine learning plays a key role in enabling next generation computing systems with enhanced intelligence. Present day machine learning systems adopt an "intelligence in the cloud" paradigm, resulting in heavy energy cost despite state-of-the-art performance. It is therefore of great interest to design embedded ultra-low power (ULP) platforms with in-silicon machine learning capability. A self-contained ULP platform consists of the energy delivery, sensing and information processing subsystems. This dissertation proposes techniques to design and optimize the ULP platform for in-silicon machine learning by exploring a trade-off that exists between energy-efficiency and robustness. This trade-off arises when the information processing functionality is integrated into the energy delivery, sensing, or emerging stochastic fabrics (e.g., CMOS operating in near-threshold voltage or voltage overscaling, and beyond CMOS devices). This dissertation presents the Compute VRM (C-VRM) to embed the information processing into the energy delivery subsystem. The C-VRM employs multiple voltage domain stacking and core swapping to achieve high total system energy efficiency in near/sub-threshold region. A prototype IC of the C-VRM is implemented in a 1.2 V, 130 nm CMOS process. Measured results indicate that the C-VRM has up to 44.8% savings in system-level energy per operation compared to the conventional system, and an efficiency ranging from 79% to 83% over an output voltage range of 0.52 V to 0.6 V. This dissertation further proposes the Compute Sensor approach to embed information processing into the sensing subsystem. The Compute Sensor eliminates both the traditional sensor-processor interface, and the high-SNR/high-energy digital processing by moving feature extraction and classification functions into the analog domain. Simulation results in 65 nm CMOS show that the proposed Compute Sensor can achieve a detection accuracy greater than 94.7% using the Caltech101 dataset, which is within 0.5% of that achieved by an ideal digital implementation. The performance is achieved with 7x to 17x lower energy than the conventional architecture for the same level of accuracy. To further explore the energy-efficiency vs. robustness trade-off, this dissertation explores the use of highly energy efficient but unreliable stochastic fabrics to implement in-silicon machine learning kernels. In order to perform reliable computation on the stochastic fabrics, this dissertation proposes to employ statistical error compensation (SEC) as an effective error compensation technique. This dissertation makes a contribution to the portfolio of SEC by proposing embedded algorithmic noise tolerance (E-ANT) for low overhead error compensation. E-ANT operates by reusing part of the main block as estimator and thus embedding the estimator into the main block. System level simulation results in a commercial 45 nm CMOS process show that E-ANT achieves up to 38% error tolerance and up to 51% energy savings compared with an uncompensated system. This dissertation makes a contribution to the theoretical understanding of stochastic fabrics by proposing a class of probabilistic error models that can accurately model the hardware errors on the stochastic fabrics. The models are validated in a commercial 45 nm CMOS process and employed to evaluate the performance of machine learning kernels in the presence of hardware errors. Performance prediction of a support vector machine (SVM) based classifier using these models indicates that the probability of detection P_{det} estimated using the proposed model is within 3% for timing errors due to voltage overscaling when the error rate p_η ≤ 80%, within 5% for timing errors due to process variation in near threshold-voltage (NTV) region (0.3 V-0.7 V) and within 2% for defect errors when the defect rate p_{saf} is between 10^{-3} and 20%, compared with HDL simulation results. Employing the proposed error model and evaluation methodology, this dissertation explores the use of distributed machine learning architectures, named classifier ensemble, to enhance the robustness of in-silicon machine learning kernels. Comparative study of distributed architectures (i.e., random forest (RF)) and centralized architectures (i.e., SVM) is performed in a commercial 45 nm CMOS process. Employing the UCI machine learning repository as input, it is determined that RF-based architectures are significantly more robust than SVM architectures in presence of timing errors in the NTV region (0.3 V- 0.7 V). Additionally, an error weighted voting technique that incorporates the timing error statistics of the NTV circuit fabric is proposed to further enhance the robustness of RF architectures. Simulation results confirm that the error weighted voting technique achieves a P_{det} that varies by only 1.4%, which is 12x lower compared to centralized architectures

Illinois Digital Environment for Access to Learning and Scholarship Repository

Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

Author: Paim Guilherme Pereira
Publication venue
Publication date: 01/01/2021
Field of study

Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

Lume 5.8

Microarchitectural Low-Power Design Techniques for Embedded Microprocessors

Author: Constantin Jeremy Hugues-Felix
Publication venue: Lausanne, EPFL
Publication date: 09/11/2016
Field of study

With the omnipresence of embedded processing in all forms of electronics today, there is a strong trend towards wireless, battery-powered, portable embedded systems which have to operate under stringent energy constraints. Consequently, low power consumption and high energy efficiency have emerged as the two key criteria for embedded microprocessor design. In this thesis we present a range of microarchitectural low-power design techniques which enable the increase of performance for embedded microprocessors and/or the reduction of energy consumption, e.g., through voltage scaling. In the context of cryptographic applications, we explore the effectiveness of instruction set extensions (ISEs) for a range of different cryptographic hash functions (SHA-3 candidates) on a 16-bit microcontroller architecture (PIC24). Specifically, we demonstrate the effectiveness of light-weight ISEs based on lookup table integration and microcoded instructions using finite state machines for operand and address generation. On-node processing in autonomous wireless sensor node devices requires deeply embedded cores with extremely low power consumption. To address this need, we present TamaRISC, a custom-designed ISA with a corresponding ultra-low-power microarchitecture implementation. The TamaRISC architecture is employed in conjunction with an ISE and standard cell memories to design a sub-threshold capable processor system targeted at compressed sensing applications. We furthermore employ TamaRISC in a hybrid SIMD/MIMD multi-core architecture targeted at moderate to high processing requirements (> 1 MOPS). A range of different microarchitectural techniques for efficient memory organization are presented. Specifically, we introduce a configurable data memory mapping technique for private and shared access, as well as instruction broadcast together with synchronized code execution based on checkpointing. We then study an inherent suboptimality due to the worst-case design principle in synchronous circuits, and introduce the concept of dynamic timing margins. We show that dynamic timing margins exist in microprocessor circuits, and that these margins are to a large extent state-dependent and that they are correlated to the sequences of instruction types which are executed within the processor pipeline. To perform this analysis we propose a circuit/processor characterization flow and tool called dynamic timing analysis. Moreover, this flow is employed in order to devise a high-level instruction set simulation environment for impact-evaluation of timing errors on application performance. The presented approach improves the state of the art significantly in terms of simulation accuracy through the use of statistical fault injection. The dynamic timing margins in microprocessors are then systematically exploited for throughput improvements or energy reductions via our proposed instruction-based dynamic clock adjustment (DCA) technique. To this end, we introduce a 6-stage 32-bit microprocessor with cycle-by-cycle DCA. Besides a comprehensive design flow and simulation environment for evaluation of the DCA approach, we additionally present a silicon prototype of a DCA-enabled OpenRISC microarchitecture fabricated in 28 nm FD-SOI CMOS. The test chip includes a suitable clock generation unit which allows for cycle-by-cycle DCA over a wide range with fine granularity at frequencies exceeding 1 GHz. Measurement results of speedups and power reductions are provided

Infoscience - École polytechnique fédérale de Lausanne