730 research outputs found

    High-level synthesis optimization for blocked floating-point matrix multiplication

    Get PDF
    In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a block-oriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using high-level synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    SYNAPSE-1: A High-Speed General Purpose Parallel Neurocomputer System

    Full text link
    This paper describes the general purpose neurocomputer SYNAPSE-1 which has been developed in cooperation between Siemens Munich and the University of Mannheim. This system contains one of the most powerful processors available for neural algorithms, the neuro signal processor MA16. The prototype system executes a test algorithm 8,000 times as fast as a Sparc-2 workstation. This processing speed has been achieved by using a system architecture which is optimally adapted to the general structure of neural algorithms. It is a systolic array of MA16 processors embedded in a multiprocessor system of general purpose microprocessors

    Shared versus distributed memory multiprocessors

    Get PDF
    The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    Readout system test benches

    Get PDF
    We propose to develop and exploit versatile multi-purpose Personal Computer-based Test Benches to support the evaluation and design of the basic elements required for digital front-end readout and data transmission systems for an LHC experiment. These test benches will have modular hardware facilities for the operation of new readout system components under realistic conditions, and will implement advanced modern software engineering concepts. They will support components such as fast ADCs, hybrid fibre-optic transceivers, and the prototype VLSI systolic array and data-flow processors currently being developed in national research laboratories and by the emerging European HDTV industry. These efforts would also lay the foundations for projects involving the development of custom-designed VLSI circuits

    The Design of a Processing Element for the Systolic Array Implementation of a Kalman Filter

    Get PDF
    The Kalman filter is an important component of optimal estimation theory. It has applications in a wide range of high performance control systems including navigational, fire control, and targeting systems. The Kalman filter, however, has not been utilized to its full potential due to the limitations of its inherent computational intensiveness which requires off-line processing or allows only low bandwidth real-time applications. The recent advances in VLSI circuit technology have created the opportunity to design algorithms and data structures for direct implementation in integrated circuits. A systolic architecture is a concept which allows the construction of massively parallel systems in integrated circuits and has been utilized as a means of achieving high data rates. A systolic system consists of a set of interconnected processing elements, each capable of performing some simple operation. The design of a processing element in an orthogonal systolic architecture will be investigated using the state of the art in VLSI technology. The goal is to create a high speed, high precision processing element which is adaptive to a highly configurable systolic architecture. In order to achieve the necessary high computational throughput, the arithmetic unit of the processing element will be implemented using the Logarithmic Number System. The Systolic architecture approach will be used in an attempt to implement a Kalman filtering system with both a high sampling rate and a small package size. The design of such a Kalman filter would enable this filtering technology to be applied to the areas of process control, computer vision, and robotics
    corecore