190 research outputs found
Measuring Improvement when Using HUB Formats to Implement Floating-Point Systems under Round-to-Nearest
MEC bajo TIN2013-42253-PThis paper analyzes the benefits of using HUB
formats to implement floating-point arithmetic under round-tonearest
mode from a quantitative point of view. Using HUB
formats to represent numbers allows the removal of the rounding
logic of arithmetic units, including sticky-bit computation. This
is shown for floating-point adders, multipliers, and converters.
Experimental analysis demonstrates that HUB formats and the
corresponding arithmetic units maintain the same accuracy as
conventional ones. On the other hand, the implementation of
these units, based on basic architectures, shows that HUB formats
simultaneously improve area, speed, and power consumption.
Specifically, based on data obtained from the synthesis, a HUB
single-precision adder is about 14% faster but consumes 38% less
area and 26% less power than the conventional adder. Similarly, a
HUB single-precision multiplier is 17% faster, uses 22% less area,
and consumes slightly less power than conventional multiplier. At
the same speed, the adder and multiplier achieve area and power
reductions of up to 50% and 40%, respectively
Implementations of high performance architecture for IEEE 754 compliant floating-point adders
This thesis presents a direct iteration and implementation on a high per-formance architecture for IEEE 754 floating-point addition. This thesis improves on the previous architecture's implementation in a variety of sub-operations required for IEEE 754 floating-point addition, which are focused on directly improving critical path delay performance. A key element of this paper is the introduction of a flagged-prefix adder within the main carry-propagation path of an end-around-carry adder. It also provides detailed documentation for the design of IEEE 754 compliant floating-point adders. This is particularly emphasized for uncommon operations and control logic used throughout floating-point addition, including denormalized numbers and multi-precision logic. The full design for this architecture has support for binary16, binary32, and binary64 operations. The full extended range provided by denormalized IEEE 754 values is supported. It also has conversion support between IEEE 754 and two's complement integer values in either binary16, binary32, or binary64 precision. The performance comparisons shown are synthesis results in cmos32soi 32nm GF technology and ARM-based standard cells
Readiness of Quantum Optimization Machines for Industrial Applications
There have been multiple attempts to demonstrate that quantum annealing and,
in particular, quantum annealing on quantum annealing machines, has the
potential to outperform current classical optimization algorithms implemented
on CMOS technologies. The benchmarking of these devices has been controversial.
Initially, random spin-glass problems were used, however, these were quickly
shown to be not well suited to detect any quantum speedup. Subsequently,
benchmarking shifted to carefully crafted synthetic problems designed to
highlight the quantum nature of the hardware while (often) ensuring that
classical optimization techniques do not perform well on them. Even worse, to
date a true sign of improved scaling with the number of problem variables
remains elusive when compared to classical optimization techniques. Here, we
analyze the readiness of quantum annealing machines for real-world application
problems. These are typically not random and have an underlying structure that
is hard to capture in synthetic benchmarks, thus posing unexpected challenges
for optimization techniques, both classical and quantum alike. We present a
comprehensive computational scaling analysis of fault diagnosis in digital
circuits, considering architectures beyond D-wave quantum annealers. We find
that the instances generated from real data in multiplier circuits are harder
than other representative random spin-glass benchmarks with a comparable number
of variables. Although our results show that transverse-field quantum annealing
is outperformed by state-of-the-art classical optimization algorithms, these
benchmark instances are hard and small in the size of the input, therefore
representing the first industrial application ideally suited for testing
near-term quantum annealers and other quantum algorithmic strategies for
optimization problems.Comment: 22 pages, 12 figures. Content updated according to Phys. Rev. Applied
versio
High Performance Digital Circuit Techniques
Achieving high performance is one of the most difficult challenges in designing digital circuits. Flip-flops and adders are key blocks in most digital systems and must therefore be designed to yield highest performance. In this thesis, a new high performance serial adder is developed while power consumption is attained. Also, a statistical framework for the design of flip-flops is introduced that ensures that such sequential circuits meet timing yield under performance criteria.
Firstly, a high performance serial adder is developed. The new adder is based on the idea of having a constant delay for the addition of two operands. While conventional adders exhibit logarithmic delay, the proposed adder works at a constant delay order. In addition, the new adder's hardware complexity is in a linear order with the word length, which consequently exhibits less area and power consumption as compared to conventional high performance adders. The thesis demonstrates the underlying algorithm used for the new adder and followed by simulation results.
Secondly, this thesis presents a statistical framework for the design of flip-flops under process variations in order to maximize their timing yield. In nanometer CMOS technologies, process variations significantly impact the timing performance of sequential circuits which may eventually cause their malfunction. Therefore, developing a framework for designing such circuits is inevitable. Our framework generates the values of the nominal design parameters; i.e., the size of gates and transmission gates of flip-flop such that maximum timing yield is achieved for flip-flops. While previous works focused on improving the yield of flip-flops, less research was done to improve the timing yield in the presence of process variations
ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION
Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large
HEVC์ JPEG ํ๋์จ์ด ๋ถํธํ๊ธฐ๋ฅผ ์ํ DCT์ Approximate Calculation
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2015. 8. ์ดํ์ฌ.Discrete Cosine Transform (DCT) is widely used for various image and video compression applications because of its excellent energy compaction property. DCT is computationally intensive and the calculations are parallelizable. Therefore it is often implemented in hardware for speeding up the calculation. However due to large size of DCT or multiple modules of DCT required for some applications, the hardware area taken up by DCT in image or video encoders become significant. The DCT required in most applications doesnt need to be exact. Taking advantage of this fact, here a novel approach is provided to reduce the hardware area cost of the DCT module. The DCT hardware module consists of combinational logic and memory. Both the components are reduced and the complete implementation is described. The application being aimed at is for HEVC and JPEG, however the idea is applicable to any DCT hardware implementation. Finally the degradation caused to encoded image and video in terms of BDBR is discussed and the gate count results from the synthesis is provided.Chapter 1 Introduction 1
1.1 2D DCT Hardware Module . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Pipelining the process . . . . . . . . . . . . . . . . . . . . 5
1.2 Approximate DCT . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2 Related Works 9
Chapter 3 The Moving Window Idea for Bit-Width Reduction 12
3.1 ML Recovery for Moving Window . . . . . . . . . . . . . . . . . 16
Chapter 4 Approximate DCT for HEVC 19
4.1 HEVC Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 DCT in HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Approximate DCT in HEVC . . . . . . . . . . . . . . . . . . . . 23
4.4.1 The three components of the DCT module . . . . . . . . 27
4.4.2 Optimizing Partial Butterfly Adder/Subtractors . . . . . 29
4.4.3 Optimizing the multiplication module . . . . . . . . . . . 30
4.4.3.1 Multiple Constant Multiplication (MCM) . . . . 32
4.4.3.2 Approximate MCM . . . . . . . . . . . . . . . . 32
4.4.4 Optimizing the transpose memory . . . . . . . . . . . . . 36
Chapter 5 Approximate DCT for JPEG 39
5.1 JPEG Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Approximate DCT . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Application of Moving Window to DCT transpose memory . . . 42
5.3.1 Ideal implementation . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Window position based on first row . . . . . . . . . . . . . 43
5.3.2.1 Cases of failure . . . . . . . . . . . . . . . . . . . 46
5.3.3 Position based on first column . . . . . . . . . . . . . . . 48
5.3.3.1 Cases of failure . . . . . . . . . . . . . . . . . . . 49
5.4 Hybrid implementation . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 6 Experimental Results 54
6.1 HEVC Experiments and Results . . . . . . . . . . . . . . . . . . 55
6.2 JPEG Experiments and Results . . . . . . . . . . . . . . . . . . . 55
Chapter 7 Conclusion 64Maste
Profile-directed specialisation of custom floating-point hardware
We present a methodology for generating
floating-point arithmetic hardware
designs which are, for suitable applications, much reduced in size, while still
retaining performance and IEEE-754 compliance. Our system uses three
key parts: a profiling tool, a set of customisable
floating-point units and a
selection of system integration methods.
We use a profiling tool for
floating-point behaviour to identify arithmetic
operations where fundamental elements of IEEE-754
floating-point may be
compromised, without generating erroneous results in the common case.
In the uncommon case, we use simple detection logic to determine when
operands lie outside the range of capabilities of the optimised hardware.
Out-of-range operations are handled by a separate, fully capable,
floatingpoint
implementation, either on-chip or by returning calculations to a host
processor. We present methods of system integration to achieve this errorcorrection.
Thus the system suffers no compromise in IEEE-754 compliance,
even when the synthesised hardware would generate erroneous results.
In particular, we identify from input operands the shift amounts required
for input operand alignment and post-operation normalisation. For operations
where these are small, we synthesise hardware with reduced-size
barrel-shifters. We also propose optimisations to take advantage of other
profile-exposed behaviours, including removing the hardware required to
swap operands in a floating-point adder or subtractor, and reducing the
exponent range to fit observed values.
We present profiling results for a range of applications, including a selection
of computational science programs, Spec FP 95 benchmarks and the
FFMPEG media processing tool, indicating which would be amenable to
our method. Selected applications which demonstrate potential for optimisation
are then taken through to a hardware implementation. We show up
to a 45% decrease in hardware size for a
floating-point datapath, with a
correctable error-rate of less then 3%, even with non-profiled datasets
An instruction systolic array architecture for multiple neural network types
Modern electronic systems, especially sensor and imaging systems, are beginning to
incorporate their own neural network subsystems. In order for these neural systems to learn in
real-time they must be implemented using VLSI technology, with as much of the learning
processes incorporated on-chip as is possible. The majority of current VLSI implementations
literally implement a series of neural processing cells, which can be connected together in an
arbitrary fashion. Many do not perform the entire neural learning process on-chip, instead
relying on other external systems to carry out part of the computation requirements of the
algorithm.
The work presented here utilises two dimensional instruction systolic arrays in an attempt to
define a general neural architecture which is closer to the biological basis of neural networks - it
is the synapses themselves, rather than the neurons, that have dedicated processing units. A
unified architecture is described which can be programmed at the microcode level in order to
facilitate the processing of multiple neural network types.
An essential part of neural network processing is the neuron activation function, which can
range from a sequential algorithm to a discrete mathematical expression. The architecture
presented can easily carry out the sequential functions, and introduces a fast method of
mathematical approximation for the more complex functions. This can be evaluated on-chip,
thus implementing the entire neural process within a single system.
VHDL circuit descriptions for the chip have been generated, and the systolic processing
algorithms and associated microcode instruction set for three different neural paradigms have
been designed. A software simulator of the architecture has been written, giving results for
several common applications in the field
Low Power Memory/Memristor Devices and Systems
This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within
- โฆ