2,054 research outputs found

    Lower-order compensation chain threshold-reduction technique for multi-stage voltage multipliers

    Get PDF
    This paper presents a novel threshold-compensation technique for multi-stage voltage multipliers employed in low power applications such as passive and autonomous wireless sensing nodes (WSNs) powered by energy harvesters. The proposed threshold-reduction technique enables a topological design methodology which, through an optimum control of the trade-off among transistor conductivity and leakage losses, is aimed at maximizing the voltage conversion efficiency (VCE) for a given ac input signal and physical chip area occupation. The conducted simulations positively assert the validity of the proposed design methodology, emphasizing the exploitable design space yielded by the transistor connection scheme in the voltage multiplier chain. An experimental validation and comparison of threshold-compensation techniques was performed, adopting 2N5247 N-channel junction field effect transistors (JFETs) for the realization of the voltage multiplier prototypes. The attained measurements clearly support the effectiveness of the proposed threshold-reduction approach, which can significantly reduce the chip area occupation for a given target output performance and ac input signal

    Novel Front-end Electronics for Time Projection Chamber Detectors

    Full text link
    Este trabajo ha sido realizado en la Organización Europea para la Investigación Nuclear (CERN) y forma parte del proyecto de investigación Europeo para futuros aceleradores lineales (EUDET). En física de partículas existen diferentes categorías de detectores de partículas. El diseño presentado esta centrado en un tipo particular de detector de trayectoria de partículas denominado TPC (Time Projection Chamber) que proporciona una imagen en tres dimensiones de las partículas eléctricamente cargadas que atraviesan su volumen gaseoso. La tesis incluye un estudio de los objetivos para futuros detectores, resumiendo los parámetros que un sistema de adquisición de datos debe cumplir en esos casos. Además, estos requisitos son comparados con los actuales sistemas de lectura utilizados en diferentes detectores TPC. Se concluye que ninguno de los sistemas cumple las restrictivas condiciones. Algunos de los principales objetivos para futuros detectores TPC son un altísimo nivel de integración, incremento del número de canales, electrónica más rápida y muy baja potencia. El principal inconveniente del estado del arte de los sistemas anteriores es la utilización de varios circuitos integrados en la cadena de adquisición. Este hecho hace imposible alcanzar el altísimo nivel de integración requerido para futuros detectores. Además, un aumento del número de canales y frecuencia de muestreo haría incrementar hasta valores no permitidos la potencia utilizada. Y en consecuencia, incrementar la refrigeración necesaria (en caso de ser posible). Una de las novedades presentadas es la integración de toda la cadena de adquisición (filtros analógicos de entrada, conversor analógico-digital (ADC) y procesado de señal digital) en un único circuito integrado en tecnología de 130nm. Este chip es el primero que realiza esta altísima integración para detectores TPC. Por otro lado, se presenta un análisis detallado de los filtros de procesado de señal. Los objetivos más importantes es la reduccióGarcía García, EJ. (2012). Novel Front-end Electronics for Time Projection Chamber Detectors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16980Palanci

    Techniques of Energy-Efficient VLSI Chip Design for High-Performance Computing

    Get PDF
    How to implement quality computing with the limited power budget is the key factor to move very large scale integration (VLSI) chip design forward. This work introduces various techniques of low power VLSI design used for state of art computing. From the viewpoint of power supply, conventional in-chip voltage regulators based on analog blocks bring the large overhead of both power and area to computational chips. Motivated by this, a digital based switchable pin method to dynamically regulate power at low circuit cost has been proposed to make computing to be executed with a stable voltage supply. For one of the widely used and time consuming arithmetic units, multiplier, its operation in logarithmic domain shows an advantageous performance compared to that in binary domain considering computation latency, power and area. However, the introduced conversion error reduces the reliability of the following computation (e.g. multiplication and division.). In this work, a fast calibration method suppressing the conversion error and its VLSI implementation are proposed. The proposed logarithmic converter can be supplied by dc power to achieve fast conversion and clocked power to reduce the power dissipated during conversion. Going out of traditional computation methods and widely used static logic, neuron-like cell is also studied in this work. Using multiple input floating gate (MIFG) metal-oxide semiconductor field-effect transistor (MOSFET) based logic, a 32-bit, 16-operation arithmetic logic unit (ALU) with zipped decoding and a feedback loop is designed. The proposed ALU can reduce the switching power and has a strong driven-in capability due to coupling capacitors compared to static logic based ALU. Besides, recent neural computations bring serious challenges to digital VLSI implementation due to overload matrix multiplications and non-linear functions. An analog VLSI design which is compatible to external digital environment is proposed for the network of long short-term memory (LSTM). The entire analog based network computes much faster and has higher energy efficiency than the digital one

    근사 컴퓨팅을 이용한 회로 노화 보상과 에너지 효율적인 신경망 구현

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 이혁재.Approximate computing reduces the cost (energy and/or latency) of computations by relaxing the correctness (i.e., precision) of computations up to the level, which is dependent on types of applications. Moreover, it can be realized in various hierarchies of computing system design from circuit level to application level. This dissertation presents the methodologies applying approximate computing across such hierarchies; compensating aging-induced delay in logic circuit by dynamic computation approximation (Chapter 1), designing energy-efficient neural network by combining low-power and low-latency approximate neuron models (Chapter 2), and co-designing in-memory gradient descent module with neural processing unit so as to address a memory bottleneck incurred by memory I/O for high-precision data (Chapter 3). The first chapter of this dissertation presents a novel design methodology to turn the timing violation caused by aging into computation approximation error without the reliability guardband or increasing the supply voltage. It can be realized by accurately monitoring the critical path delay at run-time. The proposal is evaluated at two levels: RTL component level and system level. The experimental results at the RTL component level show a significant improvement in terms of (normalized) mean squared error caused by the timing violation and, at the system level, show that the proposed approach successfully transforms the aging-induced timing violation errors into much less harmful computation approximation errors, therefore it recovers image quality up to perceptually acceptable levels. It reduces the dynamic and static power consumption by 21.45% and 10.78%, respectively, with 0.8% area overhead compared to the conventional approach. The second chapter of this dissertation presents an energy-efficient neural network consisting of alternative neuron models; Stochastic-Computing (SC) and Spiking (SP) neuron models. SC has been adopted in various fields to improve the power efficiency of systems by performing arithmetic computations stochastically, which approximates binary computation in conventional computing systems. Moreover, a recent work showed that deep neural network (DNN) can be implemented in the manner of stochastic computing and it greatly reduces power consumption. However, Stochastic DNN (SC-DNN) suffers from problem of high latency as it processes only a bit per cycle. To address such problem, it is proposed to adopt Spiking DNN (SP-DNN) as an input interface for SC-DNN since SP effectively processes more bits per cycle than SC-DNN. Moreover, this chapter resolves the encoding mismatch problem, between two different neuron models, without hardware cost by compensating the encoding mismatch with synapse weight calibration. A resultant hybrid DNN (SPSC-DNN) consists of SP-DNN as bottom layers and SC-DNN as top layers. Exploiting the reduced latency from SP-DNN and low-power consumption from SC-DNN, the proposed SPSC-DNN achieves improved energy-efficiency with lower error-rate compared to SC-DNN and SP-DNN in same network configuration. The third chapter of this dissertation proposes GradPim architecture, which accelerates the parameter updates by in-memory processing which is codesigned with 8-bit floating-point training in Neural Processing Unit (NPU) for deep neural networks. By keeping the high precision processing algorithms in memory, such as the parameter update incorporating high-precision weights in its computation, the GradPim architecture can achieve high computational efficiency using 8-bit floating point in NPU and also gain power efficiency by eliminating massive high-precision data transfers between NPU and off-chip memory. A simple extension of DDR4 SDRAM utilizing bank-group parallelism makes the operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. The experimental results show that the proposed architecture can improve the performance of the parameter update phase in the training by up to 40% and greatly reduce the memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and the DRAM area.근사 컴퓨팅은 연산의 정확도의 손실을 어플리케이션 별 적절한 수준까지 허용함으로써 연산에 필요한 비용 (에너지나 지연시간)을 줄인다. 게다가, 근사 컴퓨팅은 컴퓨팅 시스템 설계의 회로 계층부터 어플리케이션 계층까지 다양한 계층에 적용될 수 있다. 본 논문에서는 근사 컴퓨팅 방법론을 다양한 시스템 설계의 계층에 적용하여 전력과 에너지 측면에서 이득을 얻을 수 있는 방법들을 제안하였다. 이는, 연산 근사화 (computation Approximation)를 통해 회로의 노화로 인해 증가된 지연시간을 추가적인 전력소모 없이 보상하는 방법과 (챕터 1), 근사 뉴런모델 (approximate neuron model)을 이용해 에너지 효율이 높은 신경망을 구성하는 방법 (챕터 2), 그리고 메모리 대역폭으로 인한 병목현상 문제를 높은 정확도 데이터를 활용한 연산을 메모리 내에서 수행함으로써 완화시키는 방법을 (챕터3) 제안하였다. 첫 번째 챕터는 회로의 노화로 인한 지연시간위반을 (timing violation) 설계마진이나 (reliability guardband) 공급전력의 증가 없이 연산오차 (computation approximation error)를 통해 보상하는 설계방법론 (design methodology)를 제안하였다. 이를 위해 주요경로의 (critical path) 지연시간을 동작시간에 정확하게 측정할 필요가 있다. 여기서 제안하는 방법론은 RTL component와 system 단계에서 평가되었다. RTL component 단계의 실험결과를 통해 제안한 방식이 표준화된 평균제곱오차를 (normalized mean squared error) 상당히 줄였음을 볼 수 있다. 그리고 system 단계에서는 이미지처리 시스템에서 이미지의 품질이 인지적으로 충분히 회복되는 것을 보임으로써 회로노화로 인해 발생한 지연시간위반 오차가 에러의 크기가 작은 연산오차로 변경되는 것을 확인 할 수 있었다. 결론적으로, 제안된 방법론을 따랐을 때 0.8%의 공간을 (area) 더 사용하는 비용을 지불하고 21.45%의 동적전력소모와 (dynamic power consumption) 10.78%의 정적전력소모의 (static power consumption) 감소를 달성할 수 있었다. 두 번째 챕터는 근사 뉴런모델을 활용하는 고-에너지효율의 신경망을 (neural network) 제안하였다. 본 논문에서 사용한 두 가지의 근사 뉴런모델은 확률컴퓨팅과 (stochastic computing) 스파이킹뉴런 (spiking neuron) 이론들을 기반으로 모델링되었다. 확률컴퓨팅은 산술연산들을 확률적으로 수행함으로써 이진연산을 낮은 전력소모로 수행한다. 최근에 확률컴퓨팅 뉴런모델을 이용하여 심층 신경망 (deep neural network)를 구현할 수 있다는 연구가 진행되었다. 그러나, 확률컴퓨팅을 뉴런모델링에 활용할 경우 심층신경망이 매 클락사이클마다 (clock cycle) 하나의 비트만을 (bit) 처리하므로, 지연시간 측면에서 매우 나쁠 수 밖에 없는 문제가 있다. 따라서 본 논문에서는 이러한 문제를 해결하기 위하여 스파이킹 뉴런모델로 구성된 스파이킹 심층신경망을 확률컴퓨팅을 활용한 심층신경망 구조와 결합하였다. 스파이킹 뉴런모델의 경우 매 클락사이클마다 여러 비트를 처리할 수 있으므로 심층신경망의 입력 인터페이스로 사용될 경우 지연시간을 줄일 수 있다. 하지만, 확률컴퓨팅 뉴런모델과 스파이킹 뉴런모델의 경우 부호화 (encoding) 방식이 다른 문제가 있다. 따라서 본 논문에서는 해당 부호화 불일치 문제를 모델의 파라미터를 학습할 때 고려함으로써, 파라미터들의 값이 부호화 불일치를 고려하여 조절 (calibration) 될 수 있도록 하여 문제를 해결하였다. 이러한 분석의 결과로, 앞 쪽에는 스파이킹 심층신경망을 배치하고 뒷 쪽애는 확률컴퓨팅 심층신경망을 배치하는 혼성신경망을 제안하였다. 혼성신경망은 스파이킹 심층신경망을 통해 매 클락사이클마다 처리되는 비트 양의 증가로 인한 지연시간 감소 효과와 확률컴퓨팅 심층신경망의 저전력 소모 특성을 모두 활용함으로써 각 심층신경망을 따로 사용하는 경우 대비 우수한 에너지 효율성을 비슷하거나 더 나은 정확도 결과를 내면서 달성한다. 세 번째 챕터는 심층신경망을 8비트 부동소숫점 연산으로 학습하는 신경망처리유닛의 (neural processing unit) 파라미터 갱신을 (parameter update) 메모리-내-연산으로 (in-memory processing) 가속하는 GradPIM 아키텍쳐를 제안하였다. GradPIM은 8비트의 낮은 정확도 연산은 신경망처리유닛에 남기고, 높은 정확도를 가지는 데이터를 활용하는 연산은 (파라미터 갱신) 메모리 내부에 둠으로써 신경망처리유닛과 메모리간의 데이터통신의 양을 줄여, 높은 연산효율과 전력효율을 달성하였다. 또한, GradPIM은 bank-group 수준의 병렬화를 이루어 내 높은 내부 대역폭을 활용함으로써 메모리 대역폭을 크게 확장시킬 수 있게 되었다. 또한 이러한 메모리 구조의 변경이 최소화되었기 때문에 추가적인 하드웨어 비용도 최소화되었다. 실험 결과를 통해 GradPIM이 최소한의 DRAM 프로토콜 변화와 DRAM칩 내의 공간사용을 통해 심층신경망 학습과정 중 파라미터 갱신에 필요한 시간을 40%만큼 향상시켰음을 보였다.Chapter I: Dynamic Computation Approximation for Aging Compensation 1 1.1 Introduction 1 1.1.1 Chip Reliability 1 1.1.2 Reliability Guardband 2 1.1.3 Approximate Computing in Logic Circuits 2 1.1.4 Computation approximation for Aging Compensation 3 1.1.5 Motivational Case Study 4 1.2 Previous Work 5 1.2.1 Aging-induced Delay 5 1.2.2 Delay-Configurable Circuits 6 1.3 Proposed System 8 1.3.1 Overview of the Proposed System 8 1.3.2 Proposed Adder 9 1.3.3 Proposed Multiplier 11 1.3.4 Proposed Monitoring Circuit 16 1.3.5 Aging Compensation Scheme 19 1.4 Design Methodology 20 1.5 Evaluation 24 1.5.1 Experimental setup 24 1.5.2 RTL component level Adder/Multiplier 27 1.5.3 RTL component level Monitoring circuit 30 1.5.4 System level 31 1.6 Summary 38 Chapter II: Energy-Efficient Neural Network by Combining Approximate Neuron Models 40 2.1 Introduction 40 2.1.1 Deep Neural Network (DNN) 40 2.1.2 Low-power designs for DNN 41 2.1.3 Stochastic-Computing Deep Neural Network 41 2.1.4 Spiking Deep Neural Network 43 2.2 Hybrid of Stochastic and Spiking DNNs 44 2.2.1 Stochastic-Computing vs Spiking Deep Neural Network 44 2.2.2 Combining Spiking Layers and Stochastic Layers 46 2.2.3 Encoding Mismatch 47 2.3 Evaluation 49 2.3.1 Latency and Test Error 49 2.3.2 Energy Efficiency 51 2.4 Summary 54 Chapter III: GradPIM: In-memory Gradient Descent in Mixed-Precision DNN Training 55 3.1 Introduction 55 3.1.1 Neural Processing Unit 55 3.1.2 Mixed-precision Training 56 3.1.3 Mixed-precision Training with In-memory Gradient Descent 57 3.1.4 DNN Parameter Update Algorithms 59 3.1.5 Modern DRAM Architecture 61 3.1.6 Motivation 63 3.2 Previous Work 65 3.2.1 Processing-In-Memory 65 3.2.2 Co-design Neural Processing Unit and Processing-In-Memory 66 3.2.3 Low-precision Computation in NPU 67 3.3 GradPIM 68 3.3.1 GradPIM Architecture 68 3.3.2 GradPIM Operations 69 3.3.3 Timing Considerations 70 3.3.4 Update Phase Procedure 73 3.3.5 Commanding GradPIM 75 3.4 NPU Co-design with GradPIM 76 3.4.1 NPU Architecture 76 3.4.2 Data Placement 79 3.5 Evaluation 82 3.5.1 Evaluation Methodology 82 3.5.2 Experimental Results 83 3.5.3 Sensitivity Analysis 88 3.5.4 Layer Characterizations 90 3.5.5 Distributed Data Parallelism 90 3.6 Summary 92 3.6.1 Discussion 92 Bibliography 113 요약 114Docto

    Low-voltage CMOS log-companding techniques for audio applications

    Get PDF
    This paper presents a collection of novel current-mode circuit techniques for the integration of very low-voltage (down to 1 V) low-power (few hundreds of μA) complete SoCs in CMOS technologies. The new design proposal is based on both, the Log Companding theory and the MOSFET operating in subthreshold. Several basic building blocks for audio amplification, AGC and arbitrary filtering are given. The feasibility of the proposed CMOS circuits is illustrated through experimental data for different design case studies in 1.2 and 0.35 μm VLSI technologies.Comisión Interministerial de Ciencia y Tecnología TIC97-1159, TIC99- 1084European Union ESPRIT-FUSE-2306

    Electron multiplier supply and signal processing: working towards a system on chip solution

    Get PDF
    Despite the primitive operation and challenging practicalities of electron multipliers, they still outperform solid state equivalents in professional level equipment that requires single electron or photon resolution. The advent of the Micro Electronic and Mechanical (MEMs) fabrication process has the potential to miniaturise electron multipliers to allow mass production, reduce physical volume, and minimise part to part variation. The potential impact of MEMs is greatly reduced if secondary electronics associated with such devices cannot be reduced by a similar magnitude. The primary purpose of this research project was to develop the secondary electronics (power supply, divider and decoupling) to enable electron multiplier-based detectors to rival solid-state counterparts in terms of size and power consumption for use in a device the size of a mobile phone. To be comparable with solid state alternatives a System in Package (SiP) specification was targeted, with all specialised circuitry occupying the same package as the detector. To realise the reduction in size required, a number of practical limitations were identified and addressed, including standard capacitor values, behaviour under DC bias and dark discharge across PCBs. These were characterized through hardware measurement, fed into theoretical models and finally electronic assemblies were then designed around these. This bottom-up methodology was shown to have performance advantages when optimising proven topologies under restrictive design limitations. To demonstrate the size and power reduction available to new detectors, two existing topologies were optimized and evaluated using this bottom-up method. A third new topology was synthesised to better overcome identified shortcomings at a conceptual level. Performance of all three designs is reported. This proof of concept project was based around a scintillation detector employing a photo multiplier tube. However, it is equally applicable to any discrete dynode or microchannel plate electron multiplier, such as high gain pixilated imaging systems. Devices were tested in a spectroscopic scintillation radiation detection system to valuate performance deficiencies introduced by reduction of both size and power consumption. As MEMS manufactured devices are still in an early stage of development, this work did not attempt to demonstrate any overall comparison against solid state equivalents’ performance but demonstrated that the secondary electronics would not be the limiting factor in terms of cost or performance in the application to MEMs manufactured electron multipliers. The project delivered three prototypes that performed against the specification, with limitations highlighted, and a brief for a SoC solution was constructed

    Improving the Hardware Performance of Arithmetic Circuits using Approximate Computing

    Get PDF
    An application that can produce a useful result despite some level of computational error is said to be error resilient. Approximate computing can be applied to error resilient applications by intentionally introducing error to the computation in order to improve performance, and it has been shown that approximation is especially well-suited for application in arithmetic computing hardware. In this thesis, novel approximate arithmetic architectures are proposed for three different operations, namely multiplication, division, and the multiply accumulate (MAC) operation. For all designs, accuracy is evaluated in terms of mean relative error distance (MRED) and normalized mean error distance (NMED), while hardware performance is reported in terms of critical path delay, area, and power consumption. Three approximate Booth multipliers (ABM-M1, ABM-M2, ABM-M3) are designed in which two novel inexact partial product generators are used to reduce the dimensions of the partial product matrix. The proposed multipliers are compared to other state-of-the-art designs in terms of both accuracy and hardware performance, and are found to reduce power consumption by up to 56% when compared to the exact multiplier. The function of the multipliers is verified in several image processing applications. Two approximate restoring dividers (AXRD-M1, AXRD-M2) are proposed along with a novel inexact restoring divider cell. In the first divider, the conventional cells are replaced with the proposed inexact cells in several columns. The second divider computes only a subset of the trial subtractions, after which the divisor and partial remainder are rounded and encoded so that they may be used to estimate the remaining quotient bits. The proposed dividers are evaluated for accuracy and hardware performance alongside several benchmarking designs, and their function is verified using change detection and foreground extraction applications. An approximate MAC unit is presented in which the multiplication is implemented using a modified version of ABM-M3. The delay is reduced by using a fused architecture where the accumulator is summed as part of the multiplier compression. The accuracy and hardware savings of the MAC unit are measured against several works from the literature, and the design is utilized in a number of convolution operations

    Harnessing resilience: biased voltage overscaling for probabilistic signal processing

    Get PDF
    A central component of modern computing is the idea that computation requires determinism. Contrary to this belief, the primary contribution of this work shows that useful computation can be accomplished in an error-prone fashion. Focusing on low-power computing and the increasing push toward energy conservation, the work seeks to sacrifice accuracy in exchange for energy savings. Probabilistic computing forms the basis for this error-prone computation by diverging from the requirement of determinism and allowing for randomness within computing. Implemented as probabilistic CMOS (PCMOS), the approach realizes enormous energy sav- ings in applications that require probability at an algorithmic level. Extending probabilistic computing to applications that are inherently deterministic, the biased voltage overscaling (BIVOS) technique presented here constrains the randomness introduced through PCMOS. Doing so, BIVOS is able to limit the magnitude of any resulting deviations and realizes energy savings with minimal impact to application quality. Implemented for a ripple-carry adder, array multiplier, and finite-impulse-response (FIR) filter; a BIVOS solution substantially reduces energy consumption and does so with im- proved error rates compared to an energy equivalent reduced-precision solution. When applied to H.264 video decoding, a BIVOS solution is able to achieve a 33.9% reduction in energy consumption while maintaining a peak-signal-to-noise ratio of 35.0dB (compared to 14.3dB for a comparable reduced-precision solution). While the work presented here focuses on a specific technology, the technique realized through BIVOS has far broader implications. It is the departure from the conventional mindset that useful computation requires determinism that represents the primary innovation of this work. With applicability to emerging and yet to be discovered technologies, BIVOS has the potential to contribute to computing in a variety of fashions.PhDCommittee Chair: Anderson, David; Committee Member: Conte, Thomas; Committee Member: Ferri, Bonnie; Committee Member: Hasler, Paul; Committee Member: Mooney, Vincen

    Trading off Energy versus Accuracy in Modern Computing Systems:From Digital Circuit Design to Programming Techniques

    Get PDF
    The slowdown of Moore's law, which has been the driving force of the electronics industry over the last 5 decades, is causing serious problem to Integrated Circuits (ICs) improvements. Technology scaling is becoming more and more complex and fabrication costs are growing exponentially. Furthermore, the energy gains associated to technology scaling are slowing down. Meanwhile, the expected boom of Internet of Things (IoT) devices requires ultra-low power ICs to be able to operate for several years without any user intervention, and energy-efficient computing system on the server side to treat all the gathered data. Approximate computing has emerged as an alternative way to improve energy-efficiency of both, high-performance and low-power computing systems by tolerating small and occasional errors. This energy-accuracy tradeoff can be applied on a wide range of over-engineered applications, particularly those involving human senses such as video and image processing. This thesis first presents an approximate circuit design technique called Gate-Level Pruning, which consists in selectively removing logic gates from any conventional circuit in order to reduce energy consumption, critical path delay, and area occupied on silicon. A Computer Aided Design (CAD) tool has been developed and integrated in the standard digital flow and has been evaluated on several arithmetic circuits, achieving up to 78% energy-delay-area savings. It is then shown how this methodology can be applied on more complex systems made of multiple arithmetic blocks but also memory: the discrete Cosine Transform(DCT), which is a key building block for image and video processing applications. Then, the speculative adder technique is presented. It consists in cutting carry chains to significantly relax the circuit timing constraints', and therefore drastically reduce energy consumption, area and delay. It is shown that this technique leads to errors of different nature than those produced by gate-level pruning. It is therefore worth combining GLP and speculative adders to obtain even higher savings. This has been verified on IEEE-754 floating point units integrated in a 65nm process within a low-power multi-core processor. Silicon measurements show up to 27% power, 36% area and 53% power-area savings. The second part of this thesis introduces software techniques to achieve similar energy-accuracy tradeoffs on commercially available processors. By switching from double precision to single precision floating-point data type and by exploiting vectorization capabilities of modern processors, a factor 2 energy can be saved on a Newton method for solving nonlinear equations. To further investigate the origins of these savings, an energy model based on Energy Per Instructions (EPI) has been built. It turns out that less than 6% of the total energy is consumed by arithmetic operations and that savings are achieved mainly by reducing the amount of data transferred between registers, cache and main memory. One way to reduce those power-hungry data movements is to use application specific hardware accelerators. Unfortunately, a commercial processor cannot embark accelerators for all the possible applications. To that extent, hardware accelerators are implemented on a Field Programmable Gate Array (FPGA) interconnected with a general-purpose processor to further reduce the energy consumption
    corecore