48 research outputs found

    Low power digital signal processing

    Get PDF

    Can my chip behave like my brain?

    Get PDF
    Many decades ago, Carver Mead established the foundations of neuromorphic systems. Neuromorphic systems are analog circuits that emulate biology. These circuits utilize subthreshold dynamics of CMOS transistors to mimic the behavior of neurons. The objective is to not only simulate the human brain, but also to build useful applications using these bio-inspired circuits for ultra low power speech processing, image processing, and robotics. This can be achieved using reconfigurable hardware, like field programmable analog arrays (FPAAs), which enable configuring different applications on a cross platform system. As digital systems saturate in terms of power efficiency, this alternate approach has the potential to improve computational efficiency by approximately eight orders of magnitude. These systems, which include analog, digital, and neuromorphic elements combine to result in a very powerful reconfigurable processing machine.Ph.D

    YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration

    Get PDF
    Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today's CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. This prevents the adoption of CNNs in future ultra-low power Internet of Things end-nodes for near-sensor analytics. Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/-1) weights during training. These new findings bring major optimization opportunities in the arithmetic core by removing the need for expensive multiplications, as well as reducing I/O bandwidth and storage. In this work, we present an accelerator optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm2^2 and with a power dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/[email protected] V and 1135 GOp/s/[email protected] V, respectively

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Neuromorphic audio processing through real-time embedded spiking neural networks.

    Get PDF
    In this work novel speech recognition and audio processing systems based on a spiking artificial cochlea and neural networks are proposed and implemented. First, the biological behavior of the animal’s auditory system is analyzed and studied, along with the classical mechanisms of audio signal processing for sound classification, including Deep Learning techniques. Based on these studies, novel audio processing and automatic audio signal recognition systems are proposed, using a bio-inspired auditory sensor as input. A desktop software tool called NAVIS (Neuromorphic Auditory VIsualizer) for post-processing the information obtained from spiking cochleae was implemented, allowing to analyze these data for further research. Next, using a 4-chip SpiNNaker hardware platform and Spiking Neural Networks, a system is proposed for classifying different time-independent audio signals, making use of a Neuromorphic Auditory Sensor and frequency studies obtained with NAVIS. To prove the robustness and analyze the limitations of the system, the input audios were disturbed, simulating extreme noisy environments. Deep Learning mechanisms, particularly Convolutional Neural Networks, are trained and used to differentiate between healthy persons and pathological patients by detecting murmurs from heart recordings after integrating the spike information from the signals using a neuromorphic auditory sensor. Finally, a similar approach is used to train Spiking Convolutional Neural Networks for speech recognition tasks. A novel SCNN architecture for timedependent signals classification is proposed, using a buffered layer that adapts the information from a real-time input domain to a static domain. The system was deployed on a 48-chip SpiNNaker platform. Finally, the performance and efficiency of these systems were evaluated, obtaining conclusions and proposing improvements for future works.Premio Extraordinario de Doctorado U

    Compiler techniques for scalable performance of stream programs on multicore architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

    Enhanced applicability of loop transformations

    Get PDF

    Efficient implementation of resource-constrained cyber-physical systems using multi-core parallelism

    Get PDF
    The quest for more performance of applications and systems became more challenging in the recent years. Especially in the cyber-physical and mobile domain, the performance requirements increased significantly. Applications, previously found in the high-performance domain, emerge in the area of resource-constrained domain. Modern heterogeneous high-performance MPSoCs provide a solid foundation to satisfy the high demand. Such systems combine general processors with specialized accelerators ranging from GPUs to machine learning chips. On the other side of the performance spectrum, the demand for small energy efficient systems exposed by modern IoT applications increased vastly. Developing efficient software for such resource-constrained multi-core systems is an error-prone, time-consuming and challenging task. This thesis provides with PA4RES a holistic semiautomatic approach to parallelize and implement applications for such platforms efficiently. Our solution supports the developer to find good trade-offs to tackle the requirements exposed by modern applications and systems. With PICO, we propose a comprehensive approach to express parallelism in sequential applications. PICO detects data dependencies and implements required synchronization automatically. Using a genetic algorithm, PICO optimizes the data synchronization. The evolutionary algorithm considers channel capacity, memory mapping, channel merging and flexibility offered by the channel implementation with respect to execution time, energy consumption and memory footprint. PICO's communication optimization phase was able to generate a speedup almost 2 or an energy improvement of 30% for certain benchmarks. The PAMONO sensor approach enables a fast detection of biological viruses using optical methods. With a sophisticated virus detection software, a real-time virus detection running on stationary computers was achieved. Within this thesis, we were able to derive a soft real-time capable virus detection running on a high-performance embedded system, commonly found in today's smart phones. This was accomplished with smart DSE algorithm which optimizes for execution time, energy consumption and detection quality. Compared to a baseline implementation, our solution achieved a speedup of 4.1 and 87\% energy savings and satisfied the soft real-time requirements. Accepting a degradation of the detection quality, which still is usable in medical context, led to a speedup of 11.1. This work provides the fundamentals for a truly mobile real-time virus detection solution. The growing demand for processing power can no longer satisfied following well-known approaches like higher frequencies. These so-called performance walls expose a serious challenge for the growing performance demand. Approximate computing is a promising approach to overcome or at least shift the performance walls by accepting a degradation in the output quality to gain improvements in other objectives. Especially for a safe integration of approximation into existing application or during the development of new approximation techniques, a method to assess the impact on the output quality is essential. With QCAPES, we provide a multi-metric assessment framework to analyze the impact of approximation. Furthermore, QCAPES provides useful insights on the impact of approximation on execution time and energy consumption. With ApproxPICO we propose an extension to PICO to consider approximate computing during the parallelization of sequential applications

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

    Design and Implementation of Complexity Reduced Digital Signal Processors for Low Power Biomedical Applications

    Get PDF
    Wearable health monitoring systems can provide remote care with supervised, inde-pendent living which are capable of signal sensing, acquisition, local processing and transmission. A generic biopotential signal (such as Electrocardiogram (ECG), and Electroencephalogram (EEG)) processing platform consists of four main functional components. The signals acquired by the electrodes are amplified and preconditioned by the (1) Analog-Front-End (AFE) which are then digitized via the (2) Analog-to-Digital Converter (ADC) for further processing. The local digital signal processing is usually handled by a custom designed (3) Digital Signal Processor (DSP) which is responsible for either anyone or combination of signal processing algorithms such as noise detection, noise/artefact removal, feature extraction, classification and compres-sion. The digitally processed data is then transmitted via the (4) transmitter which is renown as the most power hungry block in the complete platform. All the afore-mentioned components of the wearable systems are required to be designed and fitted into an integrated system where the area and the power requirements are stringent. Therefore, hardware complexity and power dissipation of each functional component are crucial aspects while designing and implementing a wearable monitoring platform. The work undertaken focuses on reducing the hardware complexity of a biosignal DSP and presents low hardware complexity solutions that can be employed in the aforemen-tioned wearable platforms. A typical state-of-the-art system utilizes Sigma Delta (Σ∆) ADCs incorporating a Σ∆ modulator and a decimation filter whereas the state-of-the-art decimation filters employ linear phase Finite-Impulse-Response (FIR) filters with high orders that in-crease the hardware complexity [1–5]. In this thesis, the novel use of minimum phase Infinite-Impulse-Response (IIR) decimators is proposed where the hardware complexity is massively reduced compared to the conventional FIR decimators. In addition, the non-linear phase effects of these filters are also investigated since phase non-linearity may distort the time domain representation of the signal being filtered which is un-desirable effect for biopotential signals especially when the fiducial characteristics carry diagnostic importance. In the case of ECG monitoring systems the effect of the IIR filter phase non-linearity is minimal which does not affect the diagnostic accuracy of the signals. The work undertaken also proposes two methods for reducing the hardware complexity of the popular biosignal processing tool, Discrete Wavelet Transform (DWT). General purpose multipliers are known to be hardware and power hungry in terms of the number of addition operations or their underlying building blocks like full adders or half adders required. Higher number of adders leads to an increase in the power consumption which is directly proportional to the clock frequency, supply voltage, switching activity and the resources utilized. A typical Field-Programmable-Gate-Array’s (FPGA) resources are Look-up Tables (LUTs) whereas a custom Digital Signal Processor’s (DSP) are gate-level cells of standard cell libraries that are used to build adders [6]. One of the proposed methods is the replacement of the hardware and power hungry general pur-pose multipliers and the coefficient memories with reconfigurable multiplier blocks that are composed of simple shift-add networks and multiplexers. This method substantially reduces the resource utilization as well as the power consumption of the system. The second proposed method is the design and implementation of the DWT filter banks using IIR filters which employ less number of arithmetic operations compared to the state-of-the-art FIR wavelets. This reduces the hardware complexity of the analysis filter bank of the DWT and can be employed in applications where the reconstruction is not required. However, the synthesis filter bank for the IIR wavelet transform has a higher computational complexity compared to the conventional FIR wavelet synthesis filter banks since re-indexing of the filtered data sequence is required that can only be achieved via the use of extra registers. Therefore, this led to the proposal of a novel design which replaces the complex IIR based synthesis filter banks with FIR fil-ters which are the approximations of the associated IIR filters. Finally, a comparative study is presented where the hybrid IIR/FIR and FIR/FIR wavelet filter banks are de-ployed in a typical noise reduction scenario using the wavelet thresholding techniques. It is concluded that the proposed hybrid IIR/FIR wavelet filter banks provide better denoising performance, reduced computational complexity and power consumption in comparison to their IIR/IIR and FIR/FIR counterparts
    corecore