308 research outputs found

    Pipelined Implementation of a Fixed-Point Square Root Core Using Non-Restoring and Restoring Algorithm

    Get PDF
    Arithmetic Square Root is one of the most complex but nevertheless widely used operations in modern computing. A primary reason for the complexity is the irrational nature of the square root for non-perfect numbers and the iterative behavior required for square root computation. A typical RISC implementation of Square Root Computation can take anywhere from 200 - 300 cycles. If significant usage is encountered, this could result in an impact in run-time cost which would justify a direct hardware implementation that achieves the same result in as little as 20 clock cycles. Additionally, the implementation is pipelined to achieve even greater throughput compared to an instruction based implementation. The paper thus presents an efficient, pipelined implementation of a square root calculation core which implements a non-restoring algorithm of determining the square-root. The iteration count of the algorithm depends on the maximum size of the input and the desired resolution. A specific case of a 16-bit integer square root calculator with output resolution 0.001 is considered which requires a total of 18 iterations of the algorithm. In the implementation, each iteration is pipelined as a stage thereby resulting in an 18-stage pipelined square root computation core. The proposed algorithm utilizes standard arithmetic operations like addition, subtraction, shift and basic control statements to determine the output of each stage. The core is verified using SystemVerilog test-bench. The test-bench generates unconstrained random inputs stimulus and determines the expected value from the core device under test (DUT) by evaluating a Simulink generated model for the same stimulus. Functional coverage, implemented in the test-bench, determines reliability of the system and consequently the duration of the test execution

    An improved FPGA implementation of direct torque control for induction machines

    Get PDF
    This paper presents a novel direct torque control (DTC) approach for induction machines, based on an improved torque and stator flux estimator and its implementation using Field Programmable Gate Arrays (FPGA). The DTC performance is significantly improved by the use of FPGA, which can execute the DTC algorithm at higher sampling frequency. This leads to the reduction of the torque ripple and improved flux and torque estimations. The main achievements are: i) calculating a discrete integration operation of stator flux using backward Euler approach, ii) modifying a so called non-restoring method in calculating the complicated square root operation in stator flux estimator, iii) introducing a new flux sector determination method, iv) increasing the sampling frequency to 200kHz such that the digital computation will perform similar to that of the analog operation, and v) using two’s complement fixed-point format approach to minimize calculation errors and the hardware resource usage in all operations. The design was achieved in VHDL, based on a Matlab/Simulink simulation model. The Hardware-In-the-Loop (HIL) method is used to verify the functionality of the FPGA estimator. The simulation results are validated experimentally. Thus, it is demonstrated that FPGA implementation of DTC drives can achieve excellent performance at high sampling frequency

    Comparison of logarithmic and floating-point number systems implemented on Xilinx Virtex-II field-programmable gate arrays

    Get PDF
    The aim of this thesis is to compare the implementation of parameterisable LNS (logarithmic number system) and floating-point high dynamic range number systems on FPGA. The Virtex/Virtex-II range of FPGAs from Xilinx, which are the most popular FPGA technology, are used to implement the designs. The study focuses on using the low level primitives of the technology in an efficient way and so initially the design issues in implementing fixed-point operators are considered. The four basic operations of addition, multiplication, division and square root are considered. Carry- free adders, ripple-carry adders, parallel multipliers and digit recurrence division and square root are discussed. The floating-point operators use the word format and exceptions as described by the IEEE std-754. A dual-path adder implementation is described in detail, as are floating-point multiplier, divider and square root components. Results and comparisons with other works are given. The efficient implementation of function evaluation methods is considered next. An overview of current FPGA methods is given and a new piecewise polynomial implementation using the Taylor series is presented and compared with other designs in the literature. In the next section the LNS word format, accuracy and exceptions are described and two new LNS addition/subtraction function approximations are described. The algorithms for performing multiplication, division and powering in the LNS domain are also described and are compared with other designs in the open literature. Parameterisable conversion algorithms to convert to/from the fixed-point domain from/to the LNS and floating-point domain are described and implementation results given. In the next chapter MATLAB bit-true software models are given that have the exact functionality as the hardware models. The interfaces of the models are given and a serial communication system to perform low speed system tests is described. A comparison of the LNS and floating-point number systems in terms of area and delay is given. Different functions implemented in LNS and floating-point arithmetic are also compared and conclusions are drawn. The results show that when the LNS is implemented with a 6-bit or less characteristic it is superior to floating-point. However, for larger characteristic lengths the floating-point system is more efficient due to the delay and exponential area increase of the LNS addition operator. The LNS is beneficial for larger characteristics than 6-bits only for specialist applications that require a high portion of division, multiplication, square root, powering operations and few additions

    Implemetasi Komputasi Akar Kuadrat Resolusi Tinggi pada Field Programmable Gate Array (FPGA)

    Get PDF
    Komputasi akar kuadrat diperlukan pada beberapa proses pengendalian, diantaranya untuk Direct Torque Control (DTC) pada sistem penggerak motor yang membutuhkan proses perhitungan yang sangat cepat. Field Programmable Gate Array (FPGA) merupakan salah satu perangkat yang dapat digunakan untuk implementasi komputasi yang memerlukan kecepatan dan presisi tinggi. Penerapan komputasi akar kuadrat pada FPGA menggunakan metode digit by digit non-restoring dengan beberapa modifikasi agar memiliki hasil perhitungan dengan nilai error yang kecil. Sistem tersebut diimplementasikan menggunakan 32-bit input dan 16-bit output. Proses perhitungan melibatkan Finite State machine (FSM) untuk menghemat resource yang diperlukan. Proses verifikasi sistem dilakukan dalam dua tahap, yaitu verifikasi fungsional dengan aplikasi ModelSim-Altera dan verifikasi hardware menggunakan modul FPGA Cylcone IV EP4CE6E228N. Hasil verifikasi menunjukkan bahwa hasil perhitungan akar kuadrat memiliki resolusi sampai dengan 0,0039. Selain itu, sistem ini membutuhkan 157 Logic Elements dan 120 register dengan kecepatan clock tertinggi yang dicapainya adalah 205 MHz untuk input 32 bit AbstractSquare root computing is required in several control processes, such as for Direct Torque Control (DTC) on motor drive systems that require a very fast calculation process. Field Programmable Gate Array (FPGA) is one of the devices that recommended for high speed and precision computation. The implementation of the square root on the FPGA uses the digit-by-digit non-restoring method with some modifications to get a high precision of computation result. The system is implemented using 32-bit input and 16-bit output. The calculation process involves a Finite State machine (FSM) to minimize computation resources. The system verification process is carried out in two stages, i.e. functional verification using the ModelSim-Altera and hardware verification using the FPGA Cylcone IV EP4CE6E228N. The verification shows that the result of the square root calculation has a resolution of up to 0.0039. In addition, the system requires 157 Logic Elements and 120 registers with the highest clock speed can achieves 205 MHz for 32-bit input

    Developing an efficient IEEE754 compliant FPU in verilog

    Get PDF
    A floating-point unit (FPU) colloquially is a math coprocessor, which is a part of a computer system specially designed to carry out operations on floating point numbers [1]. Typical operations that are handled by FPU are addition, subtraction, multiplication and division. The aim was to build an efficient FPU that performs basic as well as transcendental functions with reduced complexity of the logic used reduced or at least comparable time bounds as those of x87 family at similar clock speed and reduced the memory requirement as far as possible. The functions performed are handling of Floating Point data, converting data to IEEE754 format, perform any one of the following arithmetic operations like addition, subtraction, multiplication, division and shift operation and transcendental operations like square Root, sine of an angle and cosine of an angle. All the above algorithms have been clocked and evaluated under Spartan 3E Synthesis environment. All the functions are built by possible efficient algorithms with several changes incorporated at our end as far as the scope permitted. Consequently all of the unit functions are unique in certain aspects and given the right environment(in terms of higher memory or say clock speed or data width better than the FPGA Spartan 3E Synthesizing environment) these functions will tend to show comparable efficiency and speed ,and if pipelined then higher throughput

    Algorithms and architectures for decimal transcendental function computation

    Get PDF
    Nowadays, there are many commercial demands for decimal floating-point (DFP) arithmetic operations such as financial analysis, tax calculation, currency conversion, Internet based applications, and e-commerce. This trend gives rise to further development on DFP arithmetic units which can perform accurate computations with exact decimal operands. Due to the significance of DFP arithmetic, the IEEE 754-2008 standard for floating-point arithmetic includes it in its specifications. The basic decimal arithmetic unit, such as decimal adder, subtracter, multiplier, divider or square-root unit, as a main part of a decimal microprocessor, is attracting more and more researchers' attentions. Recently, the decimal-encoded formats and DFP arithmetic units have been implemented in IBM's system z900, POWER6, and z10 microprocessors. Increasing chip densities and transistor count provide more room for designers to add more essential functions on application domains into upcoming microprocessors. Decimal transcendental functions, such as DFP logarithm, antilogarithm, exponential, reciprocal and trigonometric, etc, as useful arithmetic operations in many areas of science and engineering, has been specified as the recommended arithmetic in the IEEE 754-2008 standard. Thus, virtually all the computing systems that are compliant with the IEEE 754-2008 standard could include a DFP mathematical library providing transcendental function computation. Based on the development of basic decimal arithmetic units, more complex DFP transcendental arithmetic will be the next building blocks in microprocessors. In this dissertation, we researched and developed several new decimal algorithms and architectures for the DFP transcendental function computation. These designs are composed of several different methods: 1) the decimal transcendental function computation based on the table-based first-order polynomial approximation method; 2) DFP logarithmic and antilogarithmic converters based on the decimal digit-recurrence algorithm with selection by rounding; 3) a decimal reciprocal unit using the efficient table look-up based on Newton-Raphson iterations; and 4) a first radix-100 division unit based on the non-restoring algorithm with pre-scaling method. Most decimal algorithms and architectures for the DFP transcendental function computation developed in this dissertation have been the first attempt to analyze and implement the DFP transcendental arithmetic in order to achieve faithful results of DFP operands, specified in IEEE 754-2008. To help researchers evaluate the hardware performance of DFP transcendental arithmetic units, the proposed architectures based on the different methods are modeled, verified and synthesized using FPGAs or with CMOS standard cells libraries in ASIC. Some of implementation results are compared with those of the binary radix-16 logarithmic and exponential converters; recent developed high performance decimal CORDIC based architecture; and Intel's DFP transcendental function computation software library. The comparison results show that the proposed architectures have significant speed-up in contrast to the above designs in terms of the latency. The algorithms and architectures developed in this dissertation provide a useful starting point for future hardware-oriented DFP transcendental function computation researches

    High sample-rate Givens rotations for recursive least squares

    Get PDF
    The design of an application-specific integrated circuit of a parallel array processor is considered for recursive least squares by QR decomposition using Givens rotations, applicable in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation, which, for this recursive algorithm, means that the time to perform arithmetic operations is critical. The algorithm, architecture and arithmetic are considered in a single integrated design procedure to achieve optimum results. A realisation approach using standard arithmetic operators, add, multiply and divide is adopted. The design of high-throughput operators with low delay is addressed for fixed- and floating-point number formats, and the application of redundant arithmetic considered. New redundant multiplier architectures are presented enabling reductions in area of up to 25%, whilst maintaining low delay. A technique is presented enabling the use of a conventional tree multiplier in recursive applications, allowing savings in area and delay. Two new divider architectures are presented showing benefits compared with the radix-2 modified SRT algorithm. Givens rotation algorithms are examined to determine their suitability for VLSI implementation. A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed enabling the sample-rate to be increased by a factor of approximately 6 and offering area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of 136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology. The enhanced SGR algorithm has been compared with a CORDIC approach and shown to benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation on a parallel array of general purpose (GP) DSP chips, it is estimated that a single application specific chip could offer up to 1,500 times the computation obtained from a single OP DSP chip

    Compiling dataflow graphs into hardware

    Get PDF
    Department Head: L. Darrell Whitley.2005 Fall.Includes bibliographical references (pages 121-126).Conventional computers are programmed by supplying a sequence of instructions that perform the desired task. A reconfigurable processor is "programmed" by specifying the interconnections between hardware components, thereby creating a "hardwired" system to do the particular task. For some applications such as image processing, reconfigurable processors can produce dramatic execution speedups. However, programming a reconfigurable processor is essentially a hardware design discipline, making programming difficult for application programmers who are only familiar with software design techniques. To bridge this gap, a programming language, called SA-C (Single Assignment C, pronounced "sassy"), has been designed for programming reconfigurable processors. The process involves two main steps - first, the SA-C compiler analyzes the input source code and produces a hardware-independent intermediate representation of the program, called a dataflow graph (DFG). Secondly, this DFG is combined with hardware-specific information to create the final configuration. This dissertation describes the design and implementation of a system that performs the DFG to hardware translation. The DFG is broken up into three sections: the data generators, the inner loop body, and the data collectors. The second of these, the inner loop body, is used to create a computational structure that is unique for each program. The other two sections are implemented by using prebuilt modules, parameterized for the particular problem. Finally, a "glue module" is created to connect the various pieces into a complete interconnection specification. The dissertation also explores optimizations that can be applied while processing the DFG, to improve performance. A technique for pipelining the inner loop body is described that uses an estimation tool for the propagation delay of the nodes within the dataflow graph. A scheme is also described that identifies subgraphs with the dataflow graph that can be replaced with lookup tables. The lookup tables provide a faster implementation than random logic in some instances

    Novel load identification techniques and a steady state self-tuning prototype for switching mode power supplies

    Get PDF
    Control of Switched Mode Power Supplies (SMPS) has been traditionally achieved through analog means with dedicated integrated circuits (ICs). However, as power systems are becoming increasingly complex, the classical concept of control has gradually evolved into the more general problem of power management, demanding functionalities that are hardly achievable in analog controllers. The high flexibility offered by digital controllers and their capability to implement sophisticated control strategies, together with the programmability of controller parameters, make digital control very attractive as an option for improving the features of dcdc converters. On the other side, digital controllers find their major weak point in the achievable dynamic performances of the closed loop system. Indeed, analogto-digital conversion times, computational delays and sampling-related delays strongly limit the small signal closed loop bandwidth of a digitally controlled SMPS. Quantization effects set other severe constraints not known to analog solutions. For these reasons, intensive scientific research activity is addressing the problem of making digital compensator stronger competitors against their analog counterparts in terms of achievable performances. In a wide range of applications, dcdc converters with high efficiency over the whole range of their load values are required. Integrated digital controllers for Switching Mode Power Supplies are gaining growing interest, since it has been shown the feasibility of digital controller ICs specifically developed for high frequency switching converters. One very interesting potential benefit is the use of autotuning of controller parameters (on-line controllers), so that the dynamic response can be set at the software level, independently of output capacitor filters, component variations and ageing. These kind of algorithms are able to identify the output filter configuration (system identification) and then automatically compute the best compensator gains to adjust system margins and bandwidth. In order to be an interesting solution, however, the self-tuning should satisfy two important requirements: it should not heavily affect converter operation under nominal condition and it should be based on a simple and robust algorithm whose complexity does not require a significant increase of the silicon area of the IC controller. The first issue is avoided performing the system identification (SI) with the system open loop configuration, where perturbations can be induced in the system before the start up. Much more challenging is to satisfy this requirement during steady state operations, where perturbations on the output voltage are limited by the regular operations of the converter. The main advantage of steady state SI methods, is the detection of possible non-idealities occurring during the converter operations. In this way, the system dynamics can be consequently adjusted with the compensator parameters tuning. The resource saving issue, requires the development of äd-hocßelf-tuning techniques specifically tailored for integrated digitally controlled converters. Considering the flexibility of digital control, self-tuning algorithms can be studied and easily integrated at hardware level into closed loop SMPS reducing development time and R & D costs. The work of this dissertation finds its origin in this context. Smart power management is accomplished by tuning the controller parameters accordingly to the identified converter configuration. Themain difficult for self-tuning techniques is the identification of the converter output filter configuration. Two novel system identification techniques have been validated in this dissertation. The open loop SI method is based on the system step response, while dithering amplification effects are exploited for the steady state SI method. The open loop method can be used as autotunig approach during or before the system start up, a step evolving reference voltage has been used as system perturbation and to obtain the output filter information with the Power Spectral Density (PSD) computation of the system step response. The use of ¢§ modulator is largely increasing in digital control feedback. During the steady state, the finite resolution introduces quantization effects on the signal path causing low frequency contributes of the digital control word. Through oversampling-dithering capabilities of ¢§ modulators, resolution improvements are obtained. The presented steady state identification techniques demonstrates that, amplifying the dithering effects on the signal path, the output filter information can be obtained on the digital side by processing with the PSD computation the perturbed output voltage. The amount of noise added on the output voltage does not affect the converter operations, mathematical considerations have been addressed and then justified both with a Matlab/Simulink fixed-point and a FPGA-based closed loop system. The load output filter identification of both algorithms, refer to the frequency domain. When the respective perturbations occurs, the system response is observed on the digital side and processed with the PSD computation. The extracted parameters are the resonant frequency ans the possible ESR (Effective Series Resistance) contributes,which can be detected as maximumin the PSD output. The SI methods have been validated for different configurations of buck converters on a fixed-point closed loop model, however, they can be easily applied to further converter configurations. The steady state method has been successfully integrated into a FPGA-based prototype for digitally controlled buck converters, that integrates a PSD computer needed for the load parameters identification. At this purpose, a novel VHDL-coded full-scalable hybrid processor for Constant Geometry FFT (CG-FFT) computation has been designed and integrated into the PSD computation system. The processor is based on a variation of the conventional algorithm used for FFT, which is the Constant-Geometry FFT (CG-FFT).Hybrid CORDIC-LUT scalable architectures, has been introduced as alternative approach for the twiddle factors (phase factors) computation needed during the FFT algorithms execution. The shared core architecture uses a single phase rotator to satisfy all TF requests. It can achieve improved logic saving by trading off with computational speed. The pipelined architecture is composed of a number of stages equal to the number of PEs and achieves the highest possible throughput, at the expense of more hardware usage
    corecore