37 research outputs found

    Low power digital signal processing

    Get PDF

    Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs)

    Get PDF
    The general-purpose computing processor performs a wide range of functions. Although the performance of general-purpose processors has been steadily increasing, certain software technologies like multimedia and digital signal processing applications demand ever more computing power. Reconfigurable computing has emerged to combine the versatility of general-purpose processors with the customization ability of ASICs. The basic premise of reconfigurability is to provide better performance and higher computing density than fixed configuration processors. Most of the research in reconfigurable computing is dedicated to on-chip functional logic. If computing resources are adaptable to the computing requirement, the maximum performance can be achieved. To overcome the gap between processor and memory technology, the size of on-chip cache memory has been consistently increasing. The larger cache memory capacity, though beneficial in general, does not guarantee a higher performance for all the applications as they may not utilize all of the cache efficiently. To utilize on-chip resources effectively and to accelerate the performance of multimedia applications specifically, we propose a new architecture---Adaptive Balanced Computing (ABC). ABC uses dynamic resource configuration of on-chip cache memory by integrating Reconfigurable Functional Caches (RFC). RFC can work as a conventional cache or as a specialized computing unit when necessary. In order to convert a cache memory to a computing unit, we include additional logic to embed multi-bit output LUTs into the cache structure. We add the reconfigurability of cache memory to a conventional processor with minimal modification to the load/store microarchitecture and with minimal compiler assistance. ABC architecture utilizes resources more efficiently by reconfiguring the cache memory to computing units dynamically. The area penalty for this reconfiguration is about 50--60% of the memory cell cache array-only area with faster cache access time. In a base array cache (parallel decoding caches), the area penalty is 10--20% of the data array with 1--2% increase in the cache access time. However, we save 27% for FIR and 44% for DCT/IDCT in area with respect to memory cell array cache and about 80% for both applications with respect to base array cache if we were to implement all these units separately (such as ASICs). The simulations with multimedia and DSP applications (DCT/IDCT and FIR/IIR) show that the resource configuration with the RFC speedups ranging from 1.04X to 3.94X in overall applications and from 2.61X to 27.4X in the core computations. The simulations with various parameters indicate that the impact of reconfiguration can be minimized if an appropriate cache organization is selected

    Embedded electronic systems driven by run-time reconfigurable hardware

    Get PDF
    Abstract This doctoral thesis addresses the design of embedded electronic systems based on run-time reconfigurable hardware technology –available through SRAM-based FPGA/SoC devices– aimed at contributing to enhance the life quality of the human beings. This work does research on the conception of the system architecture and the reconfiguration engine that provides to the FPGA the capability of dynamic partial reconfiguration in order to synthesize, by means of hardware/software co-design, a given application partitioned in processing tasks which are multiplexed in time and space, optimizing thus its physical implementation –silicon area, processing time, complexity, flexibility, functional density, cost and power consumption– in comparison with other alternatives based on static hardware (MCU, DSP, GPU, ASSP, ASIC, etc.). The design flow of such technology is evaluated through the prototyping of several engineering applications (control systems, mathematical coprocessors, complex image processors, etc.), showing a high enough level of maturity for its exploitation in the industry.Resumen Esta tesis doctoral abarca el diseño de sistemas electrónicos embebidos basados en tecnología hardware dinámicamente reconfigurable –disponible a través de dispositivos lógicos programables SRAM FPGA/SoC– que contribuyan a la mejora de la calidad de vida de la sociedad. Se investiga la arquitectura del sistema y del motor de reconfiguración que proporcione a la FPGA la capacidad de reconfiguración dinámica parcial de sus recursos programables, con objeto de sintetizar, mediante codiseño hardware/software, una determinada aplicación particionada en tareas multiplexadas en tiempo y en espacio, optimizando así su implementación física –área de silicio, tiempo de procesado, complejidad, flexibilidad, densidad funcional, coste y potencia disipada– comparada con otras alternativas basadas en hardware estático (MCU, DSP, GPU, ASSP, ASIC, etc.). Se evalúa el flujo de diseño de dicha tecnología a través del prototipado de varias aplicaciones de ingeniería (sistemas de control, coprocesadores aritméticos, procesadores de imagen, etc.), evidenciando un nivel de madurez viable ya para su explotación en la industria.Resum Aquesta tesi doctoral està orientada al disseny de sistemes electrònics empotrats basats en tecnologia hardware dinàmicament reconfigurable –disponible mitjançant dispositius lògics programables SRAM FPGA/SoC– que contribueixin a la millora de la qualitat de vida de la societat. S’investiga l’arquitectura del sistema i del motor de reconfiguració que proporcioni a la FPGA la capacitat de reconfiguració dinàmica parcial dels seus recursos programables, amb l’objectiu de sintetitzar, mitjançant codisseny hardware/software, una determinada aplicació particionada en tasques multiplexades en temps i en espai, optimizant així la seva implementació física –àrea de silici, temps de processat, complexitat, flexibilitat, densitat funcional, cost i potència dissipada– comparada amb altres alternatives basades en hardware estàtic (MCU, DSP, GPU, ASSP, ASIC, etc.). S’evalúa el fluxe de disseny d’aquesta tecnologia a través del prototipat de varies aplicacions d’enginyeria (sistemes de control, coprocessadors aritmètics, processadors d’imatge, etc.), demostrant un nivell de maduresa viable ja per a la seva explotació a la indústria

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

    Get PDF
    Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations

    FPGA implementations for parallel multidimensional filtering algorithms

    Get PDF
    PhD ThesisOne and multi dimensional raw data collections introduce noise and artifacts, which need to be recovered from degradations by an automated filtering system before, further machine analysis. The need for automating wide-ranged filtering applications necessitates the design of generic filtering architectures, together with the development of multidimensional and extensive convolution operators. Consequently, the aim of this thesis is to investigate the problem of automated construction of a generic parallel filtering system. Serving this goal, performance-efficient FPGA implementation architectures are developed to realize parallel one/multi-dimensional filtering algorithms. The proposed generic architectures provide a mechanism for fast FPGA prototyping of high performance computations to obtain efficiently implemented performance indices of area, speed, dynamic power, throughput and computation rates, as a complete package. These parallel filtering algorithms and their automated generic architectures tackle the major bottlenecks and limitations of existing multiprocessor systems in wordlength, input data segmentation, boundary conditions as well as inter-processor communications, in order to support high data throughput real-time applications of low-power architectures using a Xilinx Virtex-6 FPGA board. For one-dimensional raw signal filtering case, mathematical model and architectural development of the generalized parallel 1-D filtering algorithms are presented using the 1-D block filtering method. Five generic architectures are implemented on a Virtex-6 ML605 board, evaluated and compared. A complete set of results on area, speed, power, throughput and computation rates are obtained and discussed as performance indices for the 1-D convolution architectures. A successful application of parallel 1-D cross-correlation is demonstrated. For two dimensional greyscale/colour image processing cases, new parallel 2-D/3-D filtering algorithms are presented and mathematically modelled using input decimation and output image reconstruction by interpolation. Ten generic architectures are implemented on the Virtex-6 ML605 board, evaluated and compared. Key results on area, speed, power, throughput and computation rate are obtained and discussed as performance indices for the 2-D convolution architectures. 2-D image reconfigurable processors are developed and implemented using single, dual and quad MAC FIR units. 3-D Colour image processors are devised to act as 3-D colour filtering engines. A 2-D cross-correlator parallel engine is successfully developed as a parallel 2-D matched filtering algorithm for locating any MRI slice within a MRI data stack library. Twelve 3-D MRI filtering operators are plugged in and adapted to be suitable for biomedical imaging, including 3-D edge operators and 3-D noise smoothing operators. Since three dimensional greyscale/colour volumetric image applications are computationally intensive, a new parallel 3-D/4-D filtering algorithm is presented and mathematically modelled using volumetric data image segmentation by decimation and output reconstruction by interpolation, after simultaneously and independently performing 3-D filtering. Eight generic architectures are developed and implemented on the Virtex-6 board, including 3-D spatial and FFT convolution architectures. Fourteen 3-D MRI filtering operators are plugged and adapted for this particular biomedical imaging application, including 3-D edge operators and 3-D noise smoothing operators. Three successful applications are presented in 4-D colour MRI (fMRI) filtering processors, k-space MRI volume data filter and 3-D cross-correlator.IRAQI Government

    Dynamically reconfigurable architecture for embedded computer vision systems

    Get PDF
    The objective of this research work is to design, develop and implement a new architecture which integrates on the same chip all the processing levels of a complete Computer Vision system, so that the execution is efficient without compromising the power consumption while keeping a reduced cost. For this purpose, an analysis and classification of different mathematical operations and algorithms commonly used in Computer Vision are carried out, as well as a in-depth review of the image processing capabilities of current-generation hardware devices. This permits to determine the requirements and the key aspects for an efficient architecture. A representative set of algorithms is employed as benchmark to evaluate the proposed architecture, which is implemented on an FPGA-based system-on-chip. Finally, the prototype is compared to other related approaches in order to determine its advantages and weaknesses
    corecore