1,520 research outputs found

    Design of a high-speed digital processing element for parallel simulation

    Get PDF
    A prototype of a custom designed computer to be used as a processing element in a multiprocessor based jet engine simulator is described. The purpose of the custom design was to give the computer the speed and versatility required to simulate a jet engine in real time. Real time simulations are needed for closed loop testing of digital electronic engine controls. The prototype computer has a microcycle time of 133 nanoseconds. This speed was achieved by: prefetching the next instruction while the current one is executing, transporting data using high speed data busses, and using state of the art components such as a very large scale integration (VLSI) multiplier. Included are discussions of processing element requirements, design philosophy, the architecture of the custom designed processing element, the comprehensive instruction set, the diagnostic support software, and the development status of the custom design

    Using FPGAs to prototype a self-timed floating point co-processor

    Get PDF
    Journal ArticleSelf- timed circuits offer advantages over their synchronously clocked counterparts in a number of situations. However, self-timed design techniques are not widely used at present for a variety of reasons. One reason for the lack of experimentation with self-timed systems is the lack of commercially available parts to support this style of design. Field programmable gate arrays (FPGAs) offer an excellent alternative for the rapid development of novel system designs provided suitable circuit structures can be implemented. This paper describes a self-timed floating point co-processor built using a combination of Actel Field Programmable Gate Arrays (FPGAs) and semi-custom CMOS chips. This co-processor implements IEEE standard single precision floating point operations on 32-bit values. The control is completely self-timed. Data moves between parts of the circuit according to local constraints only: there is no global clock or global control circuit

    High sample-rate Givens rotations for recursive least squares

    Get PDF
    The design of an application-specific integrated circuit of a parallel array processor is considered for recursive least squares by QR decomposition using Givens rotations, applicable in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation, which, for this recursive algorithm, means that the time to perform arithmetic operations is critical. The algorithm, architecture and arithmetic are considered in a single integrated design procedure to achieve optimum results. A realisation approach using standard arithmetic operators, add, multiply and divide is adopted. The design of high-throughput operators with low delay is addressed for fixed- and floating-point number formats, and the application of redundant arithmetic considered. New redundant multiplier architectures are presented enabling reductions in area of up to 25%, whilst maintaining low delay. A technique is presented enabling the use of a conventional tree multiplier in recursive applications, allowing savings in area and delay. Two new divider architectures are presented showing benefits compared with the radix-2 modified SRT algorithm. Givens rotation algorithms are examined to determine their suitability for VLSI implementation. A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed enabling the sample-rate to be increased by a factor of approximately 6 and offering area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of 136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology. The enhanced SGR algorithm has been compared with a CORDIC approach and shown to benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation on a parallel array of general purpose (GP) DSP chips, it is estimated that a single application specific chip could offer up to 1,500 times the computation obtained from a single OP DSP chip

    Power and Thermal Management of System-on-Chip

    Get PDF

    THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT

    Get PDF
    A 16 bit floating point (FP) Arithmetic Logic Unit (ALU) was designed and implemented in 0.35µm CMOS technology. Typical uses of the 16 bit FP ALU include graphics processors and embedded multimedia applications. The ALU of the modern microprocessors use a fused multiply add (FMA) design technique. An advantage of the FMA is to remove the need for a comparator which is required for a normal FP adder. The FMA consists of a multiplier, shifters, adders and rounding circuit. A fast multiplier based on the Wallace tree configuration was designed. The number of partial products was greatly reduced by the use of the modified booth encoder. The Wallace tree was chosen to reduce the number of reduction layers of partial products. The multiplier also involved the design of a pass transistor based 4:2 compressor. The average delay of the pass transistor based compressor was 55ps and was found to be 7 times faster than the full adder based 4:2 compressor. The shifters consist of separate left and right shifters using multiplexers. The shift amount is calculated using the exponents of the three operands. The addition operation is implemented using a carry skip adder (CSK). The average delay of the CSK was 1.05ns and was slower than the carry look ahead adder by about 400ps. The advantages of the CSK are reduced power, gate count and area when compared to the similar sized carry look ahead adder. The adder computes the addition of the multiplier result and the shifted value of the addend. In most modern computers, division is performed using software thereby eliminating the need for a separate hardware unit. FMA hardware unit was utilized to perform FP division. The FP divider uses the Newton Raphson algorithm to solve division by iteration. The initial approximated value with five bit accuracy was assumed to be pre-stored in cache memory and a separate clock cycle for cache read was assumed before the start of the FP division operation. In order to significantly reduce the area of the design, only one multiplier was used. Rounding to nearest technique was implemented using an 11 bit variable CSK adder. This is the best rounding technique when compared to other rounding techniques. In both the FMA and division, rounding was performed after the computation of the final result during the last clock cycle of operation. Testability analysis is performed for the multiplier which is the most complex and critical part of the FP ALU. The specific aim of testability was to ensure the correct operation of the multiplier and thus guarantee the correctness of the FMA circuit at the layout stage. The multiplier\u27s output was tested by identifying the minimal number of input vectors which toggle the inputs of the 4:2 compressors of the multiplier. The test vectors were identified in a semi automated manner using Perl scripting language. The multiplier was tested with a test set of thirty one vectors. The fault coverage of the multiplier was found to be 90.09%. The layout was implemented using IC station of Mentor Graphics CAD tool and resulted in a chip area of 1.96mm2. The specifications for basic arithmetic operations were met successfully. FP Division operation was completed within six clock cycles. The other arithmetic operations like FMA, FP addition, FP subtraction and FP multiplication were completed within three clock cycles

    VLSI Architecture and Design

    Get PDF
    Integrated circuit technology is rapidly approaching a state where feature sizes of one micron or less are tractable. Chip sizes are increasing slowly. These two developments result in considerably increased complexity in chip design. The physical characteristics of integrated circuit technology are also changing. The cost of communication will be dominating making new architectures and algorithms both feasible and desirable. A large number of processors on a single chip will be possible. The cost of communication will make designs enforcing locality superior to other types of designs. Scaling down feature sizes results in increase of the delay that wires introduce. The delay even of metal wires will become significant. Time tends to be a local property which will make the design of globally synchronous systems more difficult. Self-timed systems will eventually become a necessity. With the chip complexity measured in terms of logic devices increasing by more than an order of magnitude over the next few years the importance of efficient design methodologies and tools become crucial. Hierarchical and structured design are ways of dealing with the complexity of chip design. Structered design focuses on the information flow and enforces a high degree of regularity. Both hierarchical and structured design encourage the use of cell libraries. The geometry of the cells in such libraries should be parameterized so that for instance cells can adjust there size to neighboring cells and make the proper interconnection. Cells with this quality can be used as a basis for "Silicon Compilers"
    corecore