1,379 research outputs found
IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer Divider
Together the arithmetic logic unit (ALU) and floating-point unit (FPU) perform all of the mathematical and logic operations of computer processors. Because they are used so prominently, they fall in the critical path of the central processing unit - often becoming the bottleneck, or limiting factor for performance. As such, the design of a high-speed ALU and FPU is vital to creating a processor capable of performing up to the demanding standards of today\u27s computer users.
In this paper, both a 64-bit ALU and a 64-bit FPU are designed based on the reduced instruction set computer architecture. The ALU performs the four basic mathematical operations - addition, subtraction, multiplication and division - in both unsigned and two\u27s complement format, basic logic operations and shifting. The division algorithm is a novel approach, using a comparison multiples based SRT divider to create a variable latency integer divider. The floating-point unit performs the double-precision floating-point operations add, subtract, multiply and divide, in accordance with the IEEE 754 standard for number representation and rounding.
The ALU and FPU were implemented in VHDL, simulated in ModelSim, and constrained and synthesized using Synopsys Design Compiler (2006.06). They were synthesized using TSMC 0.1 3nm CMOS technology. The timing, power and area synthesis results were recorded, and, where applicable, compared to those of the corresponding DesignWare components.The ALU synthesis reported an area of 122,215 gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. The FPU synthesis reported an area 84,440 gates, a delay of 2.82 ns and an operating frequency of 355 MHz. It has a maximum dynamic power of 153.9 mW
Progress of analog-hybrid computation
Review of fast analog/hybrid computer systems, integrated operational amplifiers, electronic mode-control switches, digital attenuators, and packaging technique
A high-performance inner-product processor for real and complex numbers.
A novel, high-performance fixed-point inner-product processor based on a redundant binary number system is investigated in this dissertation. This scheme decreases the number of partial products to 50%, while achieving better speed and area performance, as well as providing pipeline extension opportunities. When modified Booth coding is used, partial products are reduced by almost 75%, thereby significantly reducing the multiplier addition depth. The design is applicable for digital signal and image processing applications that require real and/or complex numbers inner-product arithmetic, such as digital filters, correlation and convolution. This design is well suited for VLSI implementation and can also be embedded as an inner-product core inside a general purpose or DSP FPGA-based processor. Dynamic control of the computing structure permits different computations, such as a variety of inner-product real and complex number computations, parallel multiplication for real and complex numbers, and real and complex number division. The same structure can also be controlled to accept redundant binary number inputs for multiplication and inner-product computations. An improved 2's-complement to redundant binary converter is also presented
Design of ALU and Cache Memory for an 8 bit ALU
The design of an ALU and a Cache memory for use in a high performance processor was examined in this thesis. Advanced architectures employing increased parallelism were analyzed to minimize the number of execution cycles needed for 8 bit integer arithmetic operations. In addition to the arithmetic unit, an optimized SRAM memory cell was designed to be used as cache memory and as fast Look Up Table. The ALU consists of stand alone units for bit parallel computation of basic integer arithmetic operations. Addition and subtraction were performed using Kogge Stone parallel prefix hardware operating at 330MHz. A high performance multiplier was built using Radix 4 Modified Booth Encoder (MBE) and a Wallace Tree summation array. The multiplier requires single clock cycle for 8 bit integer multiplication and operates at a maximum frequency of 100MHz. Multiplicative division hardware was built for executing both integer division and square root. The division hardware computes 8-bit division and square root in 4 clock cycles. Multiplier forms the basic building block of all these functional units, making high level of resource sharing feasible with this architecture. The optimal operating frequency for the arithmetic unit is 70MHz. A 6T CMOS SRAM cell measuring 90 µm2 was designed using minimum size transistors. The layout allows for horizontal overlap resulting in effective area of 76 µm2 for an 8x8 array. By substituting equivalent bit line capacitance of P4 L1 Cache, the memory was simulated to have a read time of 3.27ns. An optimized set of test vectors were identified to enable high fault coverage without the need for any additional test circuitry. Sixteen test cases were identified that would toggle all the nodes and provide all possible inputs to the sub units of the multiplier. A correlation based semi automatic method was investigated to facilitate test case identification for large multipliers. This method of testability eliminates performance and area overhead associated with conventional testability hardware. Bottom up design methodology was employed for the design. The performance and area metrics are presented along with estimated power consumption. A set of Monte Carlo analysis was carried out to ensure the dependability of the design under process variations as well as fluctuations in operating conditions. The arithmetic unit was found to require a total die area of 2mm2 (approx.) in 0.35 micron process
An Arithmetic Library for Fully Homomorphic Encryption
As critical government, industry, and consumer applications move online, techniques to maintain data security and privacy must be updated. Traditional encryption methods leave data vulnerable when it is searched or modified. Homomorphic encryption fills the gap, enabling such operations on encrypted data and eliminating the vulnerable decryption step. We worked with the CUDA-accelerated Fully Homomorphic Encryption (cuFHE) Library, the fastest of its kind in the public domain, to create efficient arithmetic functions. We built upon and modified the existing gate primitives, arranging them to create functions which are hardware and application agnostic. The result is a fast platform upon which homomorphic applications can be built: applications which protect user privacy and data integrity
High sample-rate Givens rotations for recursive least squares
The design of an application-specific integrated circuit of a parallel array processor is considered
for recursive least squares by QR decomposition using Givens rotations, applicable
in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation,
which, for this recursive algorithm, means that the time to perform arithmetic operations
is critical. The algorithm, architecture and arithmetic are considered in a single
integrated design procedure to achieve optimum results.
A realisation approach using standard arithmetic operators, add, multiply and divide is
adopted. The design of high-throughput operators with low delay is addressed for fixed- and
floating-point number formats, and the application of redundant arithmetic considered. New
redundant multiplier architectures are presented enabling reductions in area of up to 25%,
whilst maintaining low delay. A technique is presented enabling the use of a conventional
tree multiplier in recursive applications, allowing savings in area and delay. Two new divider
architectures are presented showing benefits compared with the radix-2 modified SRT algorithm.
Givens rotation algorithms are examined to determine their suitability for VLSI implementation.
A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed
enabling the sample-rate to be increased by a factor of approximately 6 and offering
area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of
136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology.
The enhanced SGR algorithm has been compared with a CORDIC approach and shown to
benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation
on a parallel array of general purpose (GP) DSP chips, it is estimated that a single
application specific chip could offer up to 1,500 times the computation obtained from a
single OP DSP chip
Efficient tree-listing algorithm
Electronics LettersThe article of record may be found at http://dx.doi.org/10.1049/el:19700192An algorithm, based on the T-triangle method, is given for the generation of all the trees in a nonoriented connected graph. The efficiency of the algorithm is verified by computer results
A programmable integrated power supply for the electrostatic-drive micromotor
A 6-phase bipolarized, high-voltage power supply with rectangular pulse shape has been designed to study the special operational characteristics of various electrostatic-drive micromotors. In particular the design powers the variable-capacitance side-drive micromotor. This power supply provides variable frequency, variable voltage, and variable duty-cycle control. Simulation has been used extensively in the design and design verification.
The bipolarization (dual voltage polarity) of each pair of the phases reduces physical clamping of the rotor to the electrical shield beneath it. Thus, bipolarization of the voltage supplied to the stator nodes reduces charge build-up on the rotor.
The output frequency range varying from 1Hz to 40KHz has been achieved. This supply frequency range corresponds to motor rotational speed range of 5rpm to 200Krpm, for a micromotor with 12 stator poles and 8 rotor poles (3:2 architec-ture). The voltage amplitudes of all six phases can be varied from 20 to 200Volts.
The duty cycle of each phase can be changed by means of a parallel register. The output with variable duty cycle has been obtained, changing from 50% non-overlapping to 33% overlapping.
The power supply with 6-phase bipolarized output, variable frequency, and variable voltage output has been constructed with prototyping wire wrap boards, and assembled in a card cage. The power supply is shown to meet the design specification
Hardware acceleration of photon mapping
PhD ThesisThe quest for realism in computer-generated graphics has yielded a range of algorithmic
techniques, the most advanced of which are capable of rendering images at close to photorealistic
quality. Due to the realism available, it is now commonplace that computer graphics are used in
the creation of movie sequences, architectural renderings, medical imagery and product
visualisations.
This work concentrates on the photon mapping algorithm [1, 2], a physically based global
illumination rendering algorithm. Photon mapping excels in producing highly realistic, physically
accurate images.
A drawback to photon mapping however is its rendering times, which can be significantly longer
than other, albeit less realistic, algorithms. Not surprisingly, this increase in execution time is
associated with a high computational cost. This computation is usually performed using the
general purpose central processing unit (CPU) of a personal computer (PC), with the algorithm
implemented as a software routine. Other options available for processing these algorithms
include desktop PC graphics processing units (GPUs) and custom designed acceleration hardware
devices.
GPUs tend to be efficient when dealing with less realistic rendering solutions such as rasterisation,
however with their recent drive towards increased programmability they can also be used to
process more realistic algorithms. A drawback to the use of GPUs is that these algorithms often
have to be reworked to make optimal use of the limited resources available.
There are very few custom hardware devices available for acceleration of the photon mapping
algorithm. Ray-tracing is the predecessor to photon mapping, and although not capable of
producing the same physical accuracy and therefore realism, there are similarities between the
algorithms. There have been several hardware prototypes, and at least one commercial offering,
created with the goal of accelerating ray-trace rendering [3]. However, properties making many of
these proposals suitable for the acceleration of ray-tracing are not shared by photon mapping.
There are even fewer proposals for acceleration of the additional functions found only in photon
mapping.
All of these approaches to algorithm acceleration offer limited scalability. GPUs are inherently
difficult to scale, while many of the custom hardware devices available thus far make use of large
processing elements and complex acceleration data structures.
In this work we make use of three novel approaches in the design of highly scalable specialised
hardware structures for the acceleration of the photon mapping algorithm. Increased scalability is
gained through:
• The use of a brute-force approach in place of the commonly used smart approach, thus
eliminating much data pre-processing, complex data structures and large processing units
often required.
• The use of Logarithmic Number System (LNS) arithmetic computation, which facilitates a
reduction in processing area requirement.
• A novel redesign of the photon inclusion test, used within the photon search method of
the photon mapping algorithm. This allows an intelligent memory structure to be used for
the search.
The design uses two hardware structures, both of which accelerate one core rendering function.
Renderings produced using field programmable gate array (FPGA) based prototypes are presented,
along with details of 90nm synthesised versions of the designs which show that close to an orderof-
magnitude speedup over a software implementation is possible. Due to the scalable nature of
the design, it is likely that any advantage can be maintained in the face of improving processor
speeds.
Significantly, due to the brute-force approach adopted, it is possible to eliminate an often-used
software acceleration method. This means that the device can interface almost directly to a frontend
modelling package, minimising much of the pre-processing required by most other proposals
- …