170 research outputs found

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings

    Computational structures for application specific VLSI processors

    Get PDF

    Numerical aerodynamic simulation facility preliminary study, volume 1

    Get PDF
    A technology forecast was established for the 1980-1985 time frame and the appropriateness of various logic and memory technologies for the design of the numerical aerodynamic simulation facility was assessed. Flow models and their characteristics were analyzed and matched against candidate processor architecture. Metrics were established for the total facility, and housing and support requirements of the facility were identified. An overview of the system is presented, with emphasis on the hardware of the Navier-Stokes solver, which is the key element of the system. Software elements of the system are also discussed

    A Solder-Defined Computer Architecture for Backdoor and Malware Resistance

    Get PDF
    This research is about securing control of those devices we most depend on for integrity and confidentiality. An emerging concern is that complex integrated circuits may be subject to exploitable defects or backdoors, and measures for inspection and audit of these chips are neither supported nor scalable. One approach for providing a “supply chain firewall” may be to forgo such components, and instead to build central processing units (CPUs) and other complex logic from simple, generic parts. This work investigates the capability and speed ceiling when open-source hardware methodologies are fused with maker-scale assembly tools and visible-scale final inspection. The author has designed, and demonstrated in simulation, a 36-bit CPU and protected memory subsystem that use only synchronous static random access memory (SRAM) and trivial glue logic integrated circuits as components. The design presently lacks preemptive multitasking, ability to load firmware into the SRAMs used as logic elements, and input/output. Strategies are presented for adding these missing subsystems, again using only SRAM and trivial glue logic. A load-store architecture is employed with four clock cycles per instruction. Simulations indicate that a clock speed of at least 64 MHz is probable, corresponding to 16 million instructions per second (16 MIPS), despite the architecture containing no microprocessors, field programmable gate arrays, programmable logic devices, application specific integrated circuits, or other purchased complex logic. The lower speed, larger size, higher power consumption, and higher cost of an “SRAM minicomputer,” compared to traditional microcontrollers, may be offset by the fully open architecture—hardware and firmware—along with more rigorous user control, reliability, transparency, and auditability of the system. SRAM logic is also particularly well suited for building arithmetic logic units, and can implement complex operations such as population count, a hash function for associative arrays, or a pseudorandom number generator with good statistical properties in as few as eight clock cycles per 36-bit word processed. 36-bit unsigned multiplication can be implemented in software in 47 instructions or fewer (188 clock cycles). A general theory is developed for fast SRAM parallel multipliers should they be needed

    Fault Tolerant Integer Data Computations: Algorithms and Applications

    Get PDF
    As computing units move to higher transistor integration densities and computing clusters become highly heterogeneous, studies begin to predict that, rather than being exceptions, data corruptions in memory and processor failures are likely to become more prevalent. It has therefore become imperative to improve the reliability of systems in the face of increasing soft error probabilities in memory and computing logic units of silicon CMOS integrated chips. This thesis introduces a new class of algorithms for fault tolerance in compute-intensive linear and sesquilinear (“one-and-half-linear”) data computations on integer data inputs within high-performance computing systems. The key difference between the proposed algorithms and existing fault tolerance methods is the elimination of the traditional requirement for additional hardware resources for system reliability. The first contribution of this thesis is in the detection of hardware-induced errors in integer matrix products. The proposed method of numerical packing for detecting a single error within a quadruple of matrix outputs is described in Chapter 2. The chapter includes analytic calculations of the proposed method’s computational complexity and reliability. Experimental results show that the proposed algorithm incurs comparable execution time overhead to existing algorithms for the detection and correction of a limited number of errors within generic matrix multiplication (GEMM) outputs. On the other hand, numerical packing becomes substantially more efficient in the mitigation of multiple errors. The achieved execution time gain of numerical packing is further analyzed with respect to its energy saving equivalent, thus paving the way for a new class of silent data corruption (SDC) mitigation method for integer matrix products that are fast, energy efficient, and highly reliable. A further advancement of the proposed numerical packing approach for the mitigation of core/processor failures in computing clusters (a.k.a., failstop failures) is described in Chapter 3 . The key advantage of this new packing approach is the ability to tolerate processor failures for all classes of sum-of-product computations. Because multimedia applications running on cloud computing platforms are now required to mitigate an increasing number of failures and outages at runtime, we analyze the efficiency of numerical packing within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminable) instances. Our results show that more than 70% reduction of cost can be achieved in comparison to conventional failure-intolerant processing based on AWS EC2 on-demand (i.e., higher-cost albeit guaranteed) instances. Finally, beyond numerical packing, we present a second approach for reliability in the case of linear and sesquilinear integer data computations by generalizing the recently-proposed concept of numerical entanglement. The proposed approach is capable of recovering from multiple fail-stop failures in a parallel/distributed computing environment. We present theoretical analysis of the computational and bit-width requirements of the proposed method in comparison to existing methods of checksum generation and processing. Our experiments with integer matrix products show that the proposed approach incurs 1.72% − 37.23% reduction in processing throughput in comparison to failure-intolerant processing while allowing for the mitigation of multiple fail-stop failures without the use of additional computing resources

    Design and optimization of approximate multipliers and dividers for integer and floating-point arithmetic

    Full text link
    The dawn of the twenty-first century has witnessed an explosion in the number of digital devices and data. While the emerging deep learning algorithms to extract information from this vast sea of data are becoming increasingly compute-intensive, traditional means of improving computing power are no longer yielding gains at the same rate due to the diminishing returns from traditional technology scaling. To minimize the increasing gap between computational demands and the available resources, the paradigm of approximate computing is emerging as one of the potential solutions. Specifically, the resource-efficient approximate arithmetic units promise overall system efficiency, since most of the compute-intensive applications are dominated by arithmetic operations. This thesis primarily presents design techniques for approximate hardware multipliers and dividers. The thesis presents the design of two approximate integer multipliers and an approximate integer divider. These are: an error-configurable minimally-biased approximate integer multiplier (MBM), an error-configurable reduced-error approximate log based multiplier (REALM), and error-configurable integer divider INZeD. The two multiplier designs and the divider designs are based on the coupling of novel mathematically formulated error-reduction mechanisms in the classical approximate log based multiplier and dividers, respectively. They exhibit very low error bias and offer Pareto-optimal error vs. resource-efficiency trade-offs when compared with the state-of-the-art approximate integer multipliers/dividers. Further, the thesis also presents design of approximate floating-point multipliers and dividers. These designs utilize the optimized versions of the proposed MBM and REALM multipliers for mantissa multiplications and the proposed INZeD divider for mantissa division, and offer better design trade-offs than traditional precision scaling. The existing approximate integer dividers as well as the proposed INZeD suffer from unreasonably high worst-case error. This thesis presents WEID, which is a novel light-weight method for reducing worst-case error in approximate dividers. Finally, the thesis presents a methodology for selection of approximate arithmetic units for a given application. The methodology is based on a novel selection algorithm and utilizes the subrange error characterization of approximate arithmetic units, which performs error characterization independently in different segments of the input range

    Reduction of Line Edge Roughness (LER) in Interference-Like Large Field Lithography

    Get PDF
    Line edge roughness (LER) is seen as one of the most crucial challenges to be addressed in advanced technology nodes. In order to alleviate it, several options were explored in this work for the interference-like lithography imaging conditions. The most straight forward option was to scale interference lithography (IL) for large field integrated circuit (IC) applications. IL not only serves as a simple method to create high resolution period patterns, but, it also provides the highest theoretical contrast achievable compared to other optical lithography systems. Higher contrast yields a smaller transition region between the low and high intensity parts of the image, therefore, inherently lowers LER. Two of the challenges that would prohibit scaling IL for large field IC applications were addressed in this work: (1) field size limitations, and (2) magnification correction (i.e., pitch fine-tuning) ability. Experimental results showed less than 0.5 nm pitch adjustment capability using fused silica wedges mounted on rotational stages at 300 nm pitch pattern. A detailed discussion on maximum practical IL field size was outlined by considering the subsequent trim exposures and optical path difference effects between the interfering diffraction orders. The practical limit on the IL field size was assessed to be 10 mm for the conditions specified in this work. One of the contributors of LER is the mask absorber roughness. To mitigate it, two methods were explored that are also applicable to scanners working under interference-like conditions: (1) aerial image averaging via directional translation, and (2) pupil plane filtering. Experiments on pupil plane filtering approach were performed at Imec in Leuven, Belgium, on the ASML:NXT1950i scanner equipped with FlexWAVE wavefront manipulator. Utilizing an optimized phase filter at the pupil plane and a programmed roughness mask, the transfer of 200 nm roughness period to the wafer plane was eliminated. This mitigation effect was found to be strongly dependent on the focus

    A Parallel Processor System for Nuclear Shell-Model Calculations

    Get PDF
    This thesis describes the design and implementation of a dedicated parallel processor system for nuclear shell-model calculations. The purpose of these calculations is to determine nuclear energy eigenvalues by the tridiagonalisation of the nuclear Hamiltonian matrix using the Lanczos method. The Theoretical Nuclear Structure group at Glasgow University's Physics Department would normally perform this type of calculation on a high-performance main-frame computer. However these machines have limitations which restrict the number and scope of the calculations that can be performed. The Shell Model Processor system consists of a Multiple Microprocessor Unit (MMPU) driven by a highly pipelined dedicated front-end processor. The MMPU has a modular, moderately coupled, MIMD architecture based on autonomous processing modules. The elements within the system communicate via three shared buses. The front-end is responsible for determining the position of non-zero elements within the Hamiltonian matrix. Once the position of an element has been found it is passed to one of the free processing modules within the MMPU. The processing module then determines the value of the matrix element and performs the appropriate arithmetic to accumulate the resultant Lanczos vector. Two such processing modules have been developed. The most recently developed module is based on two MC68000 16/32 bit microprocessors. In addition there are two supervisory processor modules, one of which controls the front-end and also assists it in its function. The other module has privileged system capabilities and is responsible for supervising the system as a whole. The system has been successfully tested and performance figures are presented. The future expansion of the system to allow it to perform larger calculations is also discussed

    The Development of a High Performance Digital RF Transmitter for NMR

    No full text
    This Master’s thesis consists of the development of a Nuclear Magnetic Resonance (NMR) Radio Frequency (RF) transmitter, which is a core electronic subsystem of an NMR system. The main purpose of this research is to contribute to the application of NMR, which is a new sensing technology that has yet to be fully implemented into the everyday world. One of the barriers to adopting this technology is its complexity. However, the invention of high speed digital FPGAs (Field Programmable Gate Array) such as the Spartan series has made it easier to develop high performance NMR systems over recent years. The major contribution to this research is the development of faster digital signal processing hardware, and methodologies that have been implemented on a single chip. This has reduced the size and the cost of the electronic subsystem and contributed towards the evolution of NMR as a general tool. This thesis introduces the concept of implementing a high-speed NMR RF multi-frequency transmitter by using multiple Direct Digital Synthesis (DDS) cores to generate sine-waves, which range from 100 kHz to 750 MHz. The research required three stages to be achieved, beginning with conceptual design of a high-speed transmitter using MATLAB-Simulink, RTL-level (Register-Transfer Level) simulation and hardware implementation, which included hardware testing on a prototype board. This Master’s research is to seek a solution to building a multi-core DDS module in an FPGA device. In other words, the research work focuses on finding an alternative solution to constructing a DDS system. The project involves building up the VHSIC Hardware Description Language (VHDL) program to work beyond the hardware limitation of an FPGA device. Hence, the final solution does not consider any noise impact due to the structure of the developed system
    corecore