58 research outputs found

    Implementation of the Trigonometric LMS Algorithm using Original Cordic Rotation

    Full text link
    The LMS algorithm is one of the most successful adaptive filtering algorithms. It uses the instantaneous value of the square of the error signal as an estimate of the mean-square error (MSE). The LMS algorithm changes (adapts) the filter tap weights so that the error signal is minimized in the mean square sense. In Trigonometric LMS (TLMS) and Hyperbolic LMS (HLMS), two new versions of LMS algorithms, same formulations are performed as in the LMS algorithm with the exception that filter tap weights are now expressed using trigonometric and hyperbolic formulations, in cases for TLMS and HLMS respectively. Hence appears the CORDIC algorithm as it can efficiently perform trigonometric, hyperbolic, linear and logarithmic functions. While hardware-efficient algorithms often exist, the dominance of the software systems has kept those algorithms out of the spotlight. Among these hardware- efficient algorithms, CORDIC is an iterative solution for trigonometric and other transcendental functions. Former researches worked on CORDIC algorithm to observe the convergence behavior of Trigonometric LMS (TLMS) algorithm and obtained a satisfactory result in the context of convergence performance of TLMS algorithm. But revious researches directly used the CORDIC block output in their simulation ignoring the internal step-by-step rotations of the CORDIC processor. This gives rise to a need for verification of the convergence performance of the TLMS algorithm to investigate if it actually performs satisfactorily if implemented with step-by-step CORDIC rotation. This research work has done this job. It focuses on the internal operations of the CORDIC hardware, implements the Trigonometric LMS (TLMS) and Hyperbolic LMS (HLMS) algorithms using actual CORDIC rotations. The obtained simulation results are highly satisfactory and also it shows that convergence behavior of HLMS is much better than TLMS.Comment: 12 pages, 5 figures, 1 table. Published in IJCNC; http://airccse.org/journal/cnc/0710ijcnc08.pdf, http://airccse.org/journal/ijc2010.htm

    Protein Alignment Systolic Array Throughput Optimization

    Get PDF
    Protein comparison is gaining importance year after year since it has been demonstrated that biologists can find cor- relation between different species, or genetic mutations that can lead to cancer and genetic diseases. Protein sequence alignment is the most computational intensive task when performing protein comparison. In order to speed-up alignment, dedicated processors that can perform different computations in parallel have been designed. Among them, the best performance have been achieved using Systolic Arrays. However, when the Processing Elements of the Systolic Array have an internal loop, performance could be highly reduced. In this work we present an architectural strategy to address this problem applying pipeline interleaving; this strategy is applied to a Systolic Array for Smith Waterman algorithm that we designed. Results encourage the adoption of pipeline interleaving for parallel circuits with loop based Processing Elements. We demonstrate that important benefits in terms of higher operating frequency can be derived without so relevant costs as increased complexity, area and power required

    Interleaving in Systolic-Arrays: a Throughput Breakthrough

    Get PDF
    In past years the most common way to improve computers performance was to increase the clock frequency. In recent years this approach suffered the limits of technology scaling, therefore computers architectures are shifting toward the direction of parallel computing to further improve circuits performance. Not only GPU based architectures are spreading in consideration, but also Systolic Arrays are particularly suited for certain classes of algorithms. An important point in favor of Systolic Arrays is that, due to the regularity of their circuit layout, they are appealing when applied to many emerging and very promising technologies, like Quantum-dot Cellular Automata and nanoarrays based on Silicon NanoWire or on Carbon nanotube Field Effect Transistors. In this work we present a systematic method to improve Systolic Arrays performance exploiting Pipelining and Input Data Interleaving. We tackle the problem from a theoretical point of view first, and then we apply it to both CMOS technology and emerging technologies. On CMOS we demonstrate that it is possible to vastly improve the overall throughput of the circuit. By applying this technique to emerging technologies we show that it is possible to overcome some of their limitations greatly improving the throughput, making a considerable step forward toward the post-CMOS era

    Architectural Solutions for NanoMagnet Logic

    Get PDF
    The successful era of CMOS technology is coming to an end. The limit on minimum fabrication dimensions of transistors and the increasing leakage power hinder the technological scaling that has characterized the last decades. In several different ways, this problem has been addressed changing the architectures implemented in CMOS, adopting parallel processors and thus increasing the throughput at the same operating frequency. However, architectural alternatives cannot be the definitive answer to a continuous increase in performance dictated by Moore’s law. This problem must be addressed from a technological point of view. Several alternative technologies that could substitute CMOS in next years are currently under study. Among them, magnetic technologies such as NanoMagnet Logic (NML) are interesting because they do not dissipate any leakage power. More- over, magnets have memory capability, so it is possible to merge logic and memory in the same device. However, magnetic circuits, and NML in this specific research, have also some important drawbacks that need to be addressed: first, the circuit clock frequency is limited to 100 MHz, to avoid errors in data propagation; second, there is a connection between circuit layout and timing, and in particular, longer wires will have longer latency. These drawbacks are intrinsic to the technology and for this reason they cannot be avoided. The only chance is to limit their impact from an architectural point of view. The first step followed in the research path of this thesis is indeed the choice and optimization of architectures able to deal with the problems of NML. Systolic Ar- rays are identified as an ideal solution for this technology, because they are regular structures with local interconnections that limit the long latency of wires; more- over they are composed of several Processing Elements that work in parallel, thus exploit parallelization to increase throughput (limiting the impact of the low clock frequency). Through the analysis of Systolic Arrays for NML, several possible im- provements have been identified and addressed: 1) it has been defined a rigorous way to increase throughput with interleaving, providing equations that allow to esti- mate the number of operations to be interleaved and the rules to provide inputs; 2) a latency insensitive circuit has been designed, that exploits a data communication protocol between processing elements to avoid data synchronization problems. This feature has been exploited to design a latency insensitive Systolic Array that is able to execute the Floyd-Steinberg dithering algorithm. All the improvements presented in this framework apply to Systolic Arrays implemented in any technology. So, they can also be exploited to increase performance of today’s CMOS parallel circuits. This research path is presented in Chapter 3. While Systolic Arrays are an interesting solution for NML, their usage could be quite limited because they are normally application-specific. The second re- search path addresses this problem. A Reconfigurable Systolic Array is presented, that can be programmed to execute several algorithms. This architecture has been tested implementing many algorithms, including FIR and IIR filters, Discrete Cosine Transform and Matrix Multiplication. This research path is presented in Chapter 4. In common Von Neumann architectures, the logic part of the circuit and the memory one are separated. Today bus communication between logic and memory represents the bottleneck of the system. This problem is addressed presenting Logic- In-Memory (LIM), an architecture where memory elements are merged in logic ones. This research path aims at defining a real LIM architectures. This has been done in two steps. The first step is represented by an architecture composed of three layers: memory, routing and logic. In the second step instead the routing plane is no more present, and its features are inherited by the memory plane. In this solution, a pyramidal memory model is used, where memories near logic elements contain the most probably used data, and other memory layers contain the remaining data and instruction set. This circuit has been tested with odd-even sort algorithms and it has been benchmarked against GPUs and ASIC. This research path is presented in Chapter 5. MagnetoElastic NML (ME-NML) is a technological improvement of the NML principle, proposed by researchers of Politecnico di Torino, where the clock system is based on the induced stretch of a piezoelectric substrate when a voltage is ap- plied to its boundaries. The main advantage of this solution is that it consumes much less power than the classic clock implementation. This technology has not yet been investigated from an architectural point of view and considering complex circuits. In this research field, a standard methodology for the design of ME-NML circuits has been proposed. It is based on a Standard Cell Library and an enhanced VHDL model. The effectiveness of this methodology has been proved designing a Galois Field Multiplier. Moreover the serial-parallel trade-off in ME-NML has been investigated, designing three different solutions for the Multiply and Accumulate structure. This research path is presented in Chapter 6. While ME-NML is an extremely interesting technology, it needs to be combined with other faster technologies to have a real competitive system. Signal interfaces between NML and other technologies (mainly CMOS) have been rarely presented in literature. A mixed-technology multiplexer is designed and presented as the basis for a CMOS to NML interface. The reverse interface (from ME-NML to CMOS) is instead based on a sensing circuit for the Faraday effect: a change in the polarization of a magnet induces an electric field that can be used to generate an input signal for a CMOS circuit. This research path is presented in Chapter 7. The research work presented in this thesis represents a fundamental milestone in the path towards nanotechnologies. The most important achievement is the de- sign and simulation of complex circuits with NML, benchmarking this technology with real application examples. The characterization of a technology considering complex functions is a major step to be performed and that has not yet been ad- dressed in literature for NML. Indeed, only in this way it is possible to intercept in advance any weakness of NanoMagnet Logic that cannot be discovered consid- ering only small circuits. Moreover, the architectural improvements introduced in this thesis, although technology-driven, can be actually applied to any technology. We have demonstrated the advantages that can derive applying them to CMOS cir- cuits. This thesis represents therefore a major step in two directions: the first is the enhancement of NML technology; the second is a general improvement of parallel architectures and the development of the new Logic-In-Memory paradigm

    Serial-data computation in VLSI

    Get PDF

    Algorithm Libraries for Multi-Core Processors

    Get PDF
    By providing parallelized versions of established algorithm libraries, we ease the exploitation of the multiple cores on modern processors for the programmer. The Multi-Core STL provides basic algorithms for internal memory, while the parallelized STXXL enables multi-core acceleration for algorithms on large data sets stored on disk. Some parallelized geometric algorithms are introduced into CGAL. Further, we design and implement sorting algorithms for huge data in distributed external memory

    Accelerating Halide on an FPGA by using CIRCT and Calyx as an intermediate step to go from a high-level and software-centric IRs down to RTL

    Get PDF
    Image processing and, more generally, array processing play an essential role in modern life: from applying filters to the images that we upload to social media to running object detection algorithms on self-driving cars. Optimizing these algorithms can be complex and often results in non-portable code. The Halide language provides a simple way to write image and array processing algorithms by separating the algorithm definition (what needs to be executed) from its execution schedule (how it is executed), delivering state-of-the-art performance that exceeds hand-tuned parallel and vectorized code. Due to the inherent parallel nature of these algorithms, FPGAs present an attractive acceleration platform. While previous work has added an RTL code generator to Halide, and utilized other heterogeneous computing languages as an intermediate step, these projects are no longer maintained. MLIR is an attractive solution, allowing the generation of code that can target multiple devices, such as parallelized and vectorized CPU code, OpenMP, and CUDA. CIRCT builds on top of MLIR to convert generic MLIR code to register transfer level (RTL) languages by using Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. This thesis presents a novel flow that implements an MLIR code generator for Halide that generates RTL code, adding the necessary wrappers to execute that code on Xilinx FPGA devices. Additionally, it implements a Halide runtime using the Xilinx Runtime (XRT), enabling seamless execution of the generated Halide RTL kernels. While this thesis provides initial support for running Halide kernels and not all features and optimizations are supported, it also details the future work needed to improve the performance of the generated RTL kernels. The proposed flow serves as a foundation for further research and development in the field of hardware acceleration for image and array processing applications using Halide

    Custom optimization algorithms for efficient hardware implementation

    No full text
    The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces

    Toatie : functional hardware description with dependent types

    Get PDF
    Describing correct circuits remains a tall order, despite four decades of evolution in Hardware Description Languages (HDLs). Many enticing circuit architectures require recursive structures or complex compile-time computation — two patterns that prove difficult to capture in traditional HDLs. In a signal processing context, the Fast FIR Algorithm (FFA) structure for efficient parallel filtering proves to be naturally recursive, and most Multiple Constant Multiplication (MCM) blocks decompose multiplications into graphs of simple shifts and adds using demanding compile time computation. Generalised versions of both remain mostly in academic folklore. The implementations which do exist are often ad hoc circuit generators, written in software languages. These pose challenges for verification and are resistant to composition. Embedded functional HDLs, that represent circuits as data, allow for these descriptions at the cost of forcing the designer to work at the gate-level. A promising alternative is to use a stand-alone compiler, representing circuits as plain functions, exemplified by the CλaSH HDL. This, however, raises new challenges in capturing a circuit’s staging — which expressions in the single language should be reduced during compile-time elaboration, and which should remain in the circuit’s run-time? To better reflect the physical separation between circuit phases, this work proposes a new functional HDL (representing circuits as functions) with first-class staging constructs. Orthogonal to this, there are also long-standing challenges in the verification of parameterised circuit families. Industry surveys have consistently reported that only a slim minority of FPGA projects reach production without non-trivial bugs. While a healthy growth in the adoption of automatic formal methods is also reported, the majority of testing remains dynamic — presenting difficulties for testing entire circuit families at once. This research offers an alternative verification methodology via the combination of dependent types and automatic synthesis of user-defined data types. Given precise enough types for synthesisable data, this environment can be used to develop circuit families with full functional verification in a correct-by-construction fashion. This approach allows for verification of entire circuit families (not just one concrete member) and side-steps the state-space explosion of model checking methods. Beyond the existing work, this research offers synthesis of combinatorial circuits — not just a software model of their behaviour. This additional step requires careful consideration of staging, erasure & irrelevance, deriving bit representations of user-defined data types, and a new synthesis scheme. This thesis contributes steps towards HDLs with sufficient expressivity for awkward, combinatorial signal processing structures, allowing for a correct-by-construction approach, and a prototype compiler for netlist synthesis.Describing correct circuits remains a tall order, despite four decades of evolution in Hardware Description Languages (HDLs). Many enticing circuit architectures require recursive structures or complex compile-time computation — two patterns that prove difficult to capture in traditional HDLs. In a signal processing context, the Fast FIR Algorithm (FFA) structure for efficient parallel filtering proves to be naturally recursive, and most Multiple Constant Multiplication (MCM) blocks decompose multiplications into graphs of simple shifts and adds using demanding compile time computation. Generalised versions of both remain mostly in academic folklore. The implementations which do exist are often ad hoc circuit generators, written in software languages. These pose challenges for verification and are resistant to composition. Embedded functional HDLs, that represent circuits as data, allow for these descriptions at the cost of forcing the designer to work at the gate-level. A promising alternative is to use a stand-alone compiler, representing circuits as plain functions, exemplified by the CλaSH HDL. This, however, raises new challenges in capturing a circuit’s staging — which expressions in the single language should be reduced during compile-time elaboration, and which should remain in the circuit’s run-time? To better reflect the physical separation between circuit phases, this work proposes a new functional HDL (representing circuits as functions) with first-class staging constructs. Orthogonal to this, there are also long-standing challenges in the verification of parameterised circuit families. Industry surveys have consistently reported that only a slim minority of FPGA projects reach production without non-trivial bugs. While a healthy growth in the adoption of automatic formal methods is also reported, the majority of testing remains dynamic — presenting difficulties for testing entire circuit families at once. This research offers an alternative verification methodology via the combination of dependent types and automatic synthesis of user-defined data types. Given precise enough types for synthesisable data, this environment can be used to develop circuit families with full functional verification in a correct-by-construction fashion. This approach allows for verification of entire circuit families (not just one concrete member) and side-steps the state-space explosion of model checking methods. Beyond the existing work, this research offers synthesis of combinatorial circuits — not just a software model of their behaviour. This additional step requires careful consideration of staging, erasure & irrelevance, deriving bit representations of user-defined data types, and a new synthesis scheme. This thesis contributes steps towards HDLs with sufficient expressivity for awkward, combinatorial signal processing structures, allowing for a correct-by-construction approach, and a prototype compiler for netlist synthesis

    Finite worldlength effects in fixed-point implementations of linear systems

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 173-194).by Vinay Mohta.M.Eng
    • …
    corecore