464 research outputs found

    Profile-directed specialisation of custom floating-point hardware

    No full text
    We present a methodology for generating floating-point arithmetic hardware designs which are, for suitable applications, much reduced in size, while still retaining performance and IEEE-754 compliance. Our system uses three key parts: a profiling tool, a set of customisable floating-point units and a selection of system integration methods. We use a profiling tool for floating-point behaviour to identify arithmetic operations where fundamental elements of IEEE-754 floating-point may be compromised, without generating erroneous results in the common case. In the uncommon case, we use simple detection logic to determine when operands lie outside the range of capabilities of the optimised hardware. Out-of-range operations are handled by a separate, fully capable, floatingpoint implementation, either on-chip or by returning calculations to a host processor. We present methods of system integration to achieve this errorcorrection. Thus the system suffers no compromise in IEEE-754 compliance, even when the synthesised hardware would generate erroneous results. In particular, we identify from input operands the shift amounts required for input operand alignment and post-operation normalisation. For operations where these are small, we synthesise hardware with reduced-size barrel-shifters. We also propose optimisations to take advantage of other profile-exposed behaviours, including removing the hardware required to swap operands in a floating-point adder or subtractor, and reducing the exponent range to fit observed values. We present profiling results for a range of applications, including a selection of computational science programs, Spec FP 95 benchmarks and the FFMPEG media processing tool, indicating which would be amenable to our method. Selected applications which demonstrate potential for optimisation are then taken through to a hardware implementation. We show up to a 45% decrease in hardware size for a floating-point datapath, with a correctable error-rate of less then 3%, even with non-profiled datasets

    DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory

    Full text link
    Processing-in-memory (PIM), as a novel computing paradigm, provides significant performance benefits from the aspect of effective data movement reduction. SRAM-based PIM has been demonstrated as one of the most promising candidates due to its endurance and compatibility. However, the integration density of SRAM-based PIM is much lower than other non-volatile memory-based ones, due to its inherent 6T structure for storing a single bit. Within comparable area constraints, SRAM-based PIM exhibits notably lower capacity. Thus, aiming to unleash its capacity potential, we propose DDC-PIM, an efficient algorithm/architecture co-design methodology that effectively doubles the equivalent data capacity. At the algorithmic level, we propose a filter-wise complementary correlation (FCC) algorithm to obtain a bitwise complementary pair. At the architecture level, we exploit the intrinsic cross-coupled structure of 6T SRAM to store the bitwise complementary pair in their complementary states (Q/QQ/\overline{Q}), thereby maximizing the data capacity of each SRAM cell. The dual-broadcast input structure and reconfigurable unit support both depthwise and pointwise convolution, adhering to the requirements of various neural networks. Evaluation results show that DDC-PIM yields about 2.84×2.84\times speedup on MobileNetV2 and 2.69×2.69\times on EfficientNet-B0 with negligible accuracy loss compared with PIM baseline implementation. Compared with state-of-the-art SRAM-based PIM macros, DDC-PIM achieves up to 8.41×8.41\times and 2.75×2.75\times improvement in weight density and area efficiency, respectively.Comment: 14 pages, to be published in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD

    Serial-data computation in VLSI

    Get PDF

    Characterization and Design of High-Level VHDL I/Q Frequency Downconverter via Special Sampling Scheme

    Get PDF
    This study explores the characterization and implementation of a Special Sampling Scheme (SSS) for In-Phase and Quad-Phase (I/Q) down conversion utilizing top-level, portable design strategies. The SSS is an under-developed signal sampling methodology that can be used with military and industry receiver systems, specifically, United States Air Force (USAF) video receiver systems. The SSS processes a digital input signal-stream sampled at a specified sampling frequency, and down converts it into In-Phase (I) and Quad-Phase (Q) output signal-streams. Using the theory and application of the SSS, there are three main objectives that will be accomplished: characterization of the effects of input, output, and filter coefficient parameters on the I/Q imbalances using the SSS; development and verification of abstract, top-level VHDL code of the I/Q SSS for hardware implementation; and finally, development, verification, and analysis of variation between synthesizable pipelined and sequential VHDL implementations of the SSS for Field Programmable Gate Arrays (FPGA) and Application Specific Integrated Circuits (ASIC)

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Study of spaceborne multiprocessing, phase 1

    Get PDF
    Multiprocessing computer organizations and their application to future space mission

    Design of Resistive Synaptic Devices and Array Architectures for Neuromorphic Computing

    Get PDF
    abstract: Over the past few decades, the silicon complementary-metal-oxide-semiconductor (CMOS) technology has been greatly scaled down to achieve higher performance, density and lower power consumption. As the device dimension is approaching its fundamental physical limit, there is an increasing demand for exploration of emerging devices with distinct operating principles from conventional CMOS. In recent years, many efforts have been devoted in the research of next-generation emerging non-volatile memory (eNVM) technologies, such as resistive random access memory (RRAM) and phase change memory (PCM), to replace conventional digital memories (e.g. SRAM) for implementation of synapses in large-scale neuromorphic computing systems. Essentially being compact and “analog”, these eNVM devices in a crossbar array can compute vector-matrix multiplication in parallel, significantly speeding up the machine/deep learning algorithms. However, non-ideal eNVM device and array properties may hamper the learning accuracy. To quantify their impact, the sparse coding algorithm was used as a starting point, where the strategies to remedy the accuracy loss were proposed, and the circuit-level design trade-offs were also analyzed. At architecture level, the parallel “pseudo-crossbar” array to prevent the write disturbance issue was presented. The peripheral circuits to support various parallel array architectures were also designed. One key component is the read circuit that employs the principle of integrate-and-fire neuron model to convert the analog column current to digital output. However, the read circuit is not area-efficient, which was proposed to be replaced with a compact two-terminal oscillation neuron device that exhibits metal-insulator-transition phenomenon. To facilitate the design exploration, a circuit-level macro simulator “NeuroSim” was developed in C++ to estimate the area, latency, energy and leakage power of various neuromorphic architectures. NeuroSim provides a wide variety of design options at the circuit/device level. NeuroSim can be used alone or as a supporting module to provide circuit-level performance estimation in neural network algorithms. A 2-layer multilayer perceptron (MLP) simulator with integration of NeuroSim was demonstrated to evaluate both the learning accuracy and circuit-level performance metrics for the online learning and offline classification, as well as to study the impact of eNVM reliability issues such as data retention and write endurance on the learning performance.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Research in the effective implementation of guidance computers with large scale arrays Interim report

    Get PDF
    Functional logic character implementation in breadboard design of NASA modular compute
    corecore