45 research outputs found

    Simultaneous floating-point sine and cosine for VLIW integer processors

    Get PDF
    Accepted for publication in the proceedings of the 23rd IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2012).International audienceGraphics and signal processing applications often require that sines and cosines be evaluated at a same floating-point argument, and in such cases a very fast computation of the pair of values is desirable. This paper studies how 32-bit VLIW integer architectures can be exploited in order to perform this task accurately for IEEE single precision. We describe software implementations for sinf, cosf, and sincosf over [-pi/4,pi/4] that have a proven 1-ulp accuracy and whose latency on STMicroelectronics' ST231 VLIW integer processor is 19, 18, and 19 cycles, respectively. Such performances are obtained by introducing a novel algorithm for simultaneous sine and cosine that combines univariate and bivariate polynomial evaluation schemes

    Non-generic floating-point software support for embedded media processing

    Get PDF
    International audienceThis paper presents some work in progress on the design and implementation of efficient floating-point software support for embedded integer processors. We provide quantitative evidence of the benefits of supporting various non-generic (that is, specialized, fused, or simultaneous) operations in addition to the five basic arithmetic operations: for individual calls, speedups range from 1.12 to 4.86, while on DSP kernels and benchmarks, our approach allows us to be up to 1.34x faster

    Digital signal processor fundamentals and system design

    Get PDF
    Digital Signal Processors (DSPs) have been used in accelerator systems for more than fifteen years and have largely contributed to the evolution towards digital technology of many accelerator systems, such as machine protection, diagnostics and control of beams, power supply and motors. This paper aims at familiarising the reader with DSP fundamentals, namely DSP characteristics and processing development. Several DSP examples are given, in particular on Texas Instruments DSPs, as they are used in the DSP laboratory companion of the lectures this paper is based upon. The typical system design flow is described; common difficulties, problems and choices faced by DSP developers are outlined; and hints are given on the best solution

    Reviewing GPU architectures to build efficient back projection for parallel geometries

    Get PDF
    Back-Projection is the major algorithm in Computed Tomography to reconstruct images from a set of recorded projections. It is used for both fast analytical methods and high-quality iterative techniques. X-ray imaging facilities rely on Back-Projection to reconstruct internal structures in material samples and living organisms with high spatial and temporal resolution. Fast image reconstruction is also essential to track and control processes under study in real-time. In this article, we present efficient implementations of the Back-Projection algorithm for parallel hardware. We survey a range of parallel architectures presented by the major hardware vendors during the last 10 years. Similarities and differences between these architectures are analyzed and we highlight how specific features can be used to enhance the reconstruction performance. In particular, we build a performance model to find hardware hotspots and propose several optimizations to balance the load between texture engine, computational and special function units, as well as different types of memory maximizing the utilization of all GPU subsystems in parallel. We further show that targeting architecture-specific features allows one to boost the performance 2–7 times compared to the current state-of-the-art algorithms used in standard reconstructions codes. The suggested load-balancing approach is not limited to the back-projection but can be used as a general optimization strategy for implementing parallel algorithms

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Methods for Optimizing OpenCL Applications on Heterogeneous Multicore Architectures

    Full text link

    Neural network computing using on-chip accelerators

    Get PDF
    The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources. In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications

    Soft-core dataflow processor architecture optimized for radar signal processing

    Get PDF
    Current radar signal processors (RSPs) lack either performance or flexibility. Custom soft-core processors exhibit potential in high-performance signal processing applications, yet remain relatively unexplored in research literature. In this paper, we use an iterative design methodology to propose a novel soft-core streaming processor architecture. The datapaths of this architecture are arranged in a circular pattern, with multiple operands simultaneously flowing between switching multiplexers and functional units each cycle. By explicitly specifying instruction-level parallelism and software pipelining, applications can fully exploit the available computational resources. The proposed architecture exceeds the clock cycle performance of a commercial high-end digital signal processor (DSP) processor by an average factor of 14 over a range of typical operating parameters in an RSP application.http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6928471hb201

    Efficient software development for microprocessor based embedded system.

    Get PDF
    Tang Tze Yeung Eric.Thesis submitted in: July 2003.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 69-75).Abstracts in English and Chinese.ABSTRACT --- p.IIACKNOWLEDGMENT --- p.IIChapter 1 --- INTRODUCTION --- p.1Chapter 1.1 --- Embedded System --- p.1Chapter 1.2 --- Embedded Processor --- p.1Chapter 1.3 --- Embedded System Design --- p.3Chapter 1.3.1 --- Current Embedded System Design Challenges --- p.3Chapter 1.3.2 --- Embedded System Design Trend --- p.4Chapter 1.4 --- Efficient Software Development for Microprocessor --- p.8Chapter 1.4.1 --- Efficient Software Development Methodology --- p.8Chapter 1.5 --- Thesis Organization --- p.10Chapter 2 --- SOURCE CODE OPTIMIZATION --- p.11Chapter 2.1 --- Source Code Optimization Strategy --- p.11Chapter 2.2 --- Source Code Transformations --- p.12Chapter 2.2.1 --- Strength Reduction --- p.12Chapter 2.2.2 --- Function Inlining --- p.13Chapter 2.2.3 --- Table Lookup --- p.13Chapter 2.2.4 --- Loop Transformations --- p.13Chapter 2.2.5 --- Software Pipelining --- p.15Chapter 2.2.6 --- Register Allocation --- p.17Chapter 2.3 --- Case Study: Source Code Optimization on the StrongARM (SA1110) Platform --- p.18Chapter 2.3.1 --- StrongARM architecture --- p.18Chapter 2.3.2 --- StrongARM pipeline hazard illustration --- p.20Chapter 2.3.3 --- Source Code Optimization on StrongARM --- p.21Chapter 2.3.4 --- Instruction Set Optimization of StrongARM --- p.27Chapter 2.4 --- Conclusion --- p.32Chapter 3 --- FLOAT-TO-FIXED OPTIMIZATION --- p.33Chapter 3.1 --- Introduction to Fixed-point --- p.34Chapter 3.1.1 --- Fixed-point representation --- p.34Chapter 3.1.2 --- Fixed-point implementation --- p.35Chapter 3.1.3 --- Mathematical functions implementation --- p.38Chapter 3.2 --- Case Study: Fingerprint Minutiae Extraction Algorithms on the Strong ARM platform --- p.41Chapter 3.2.1 --- Fingerprint Verification Overview --- p.42Chapter 3.2.2 --- Fixed-point Implementation of Fingerprint Minutiae Extraction Algorithm --- p.49Chapter 3.2.3 --- Experimental Results --- p.51Chapter 3.3 --- Conclusion --- p.56Chapter 4 --- DOMAIN SPECIFIC OPTIMIZATION --- p.57Chapter 4.1 --- Case Study: Font Rasterization on the Strong ARM platform --- p.57Chapter 4.1.1 --- Outline Font --- p.57Chapter 4.1.2 --- Font Rasterization --- p.59Chapter 4.1.3 --- Experiments --- p.63Chapter 4.2 --- Conclusion --- p.66Chapter 5 --- CONCLUSION --- p.67BIBLIOGRAPHY --- p.6

    A FPGA/DSP design for real-time fracture detection using low transient pulse

    Get PDF
    This work presents the hardware and software architecture for the detection of fractures and edges in materials. While the detection method is based on the novel concept of Low Transient Pulse (LTP), the overall system implementation is based on two digital microelectronics technologies widely used for signal processing: Digital Signal Processor (DSP) and Field Programmable Gate Array (FPGA). Under the proposed architecture, the DSP carries out the analysis of the received baseband signal at a lower rate and hence can be used for large number of signal channels. The FPGA\u27s master clock runs at a higher frequency (62.5MHz) for the generation of LTP signal and to demodulate the passband ultrasonic signals sampled at 1MHz which interrupts the DSP at every 1 [Is. This research elaborates on designing a Quadrature Amplitude Modulator - demodulator (QAM) on the FPGA for the received signal from the ultrasound and edge detection on the DSP processor to detect the presence of edges/fractures on a test Sawbone plate. In this work, the LTP technology is applied to determine the location of the Sawbone plate edges based on the reflected signals to the receivers. This signal is then passed through a QAM to get the maxima (peaks) at the received signal to study the parameters in the DSP. This work successfully demonstrates the feasibility of modular programming approach across the two platforms. The dual time scale platform readily accommodates higher temporal resolution needed for the generation of Low Transient Pulses and the processing of real time baseband signals on the DSP for various test conditions
    corecore