42 research outputs found

    A Convolve-And-MErge Approach for Exact Computations on High-Performance Reconfigurable Computers

    Get PDF
    This work presents an approach for accelerating arbitrary-precision arithmetic on high-performance reconfigurable computers (HPRCs). Although faster and smaller, fixed-precision arithmetic has inherent rounding and overflow problems that can cause errors in scientific or engineering applications. This recurring phenomenon is usually referred to as numerical nonrobustness. Therefore, there is an increasing interest in the paradigm of exact computation, based on arbitrary-precision arithmetic. There are a number of libraries and/or languages supporting this paradigm, for example, the GNU multiprecision (GMP) library. However, the performance of computations is significantly reduced in comparison to that of fixed-precision arithmetic. In order to reduce this performance gap, this paper investigates the acceleration of arbitrary-precision arithmetic on HPRCs. A Convolve-And-MErge approach is proposed, that implements virtual convolution schedules derived from the formal representation of the arbitrary-precision multiplication problem. Additionally, dynamic (nonlinear) pipeline techniques are also exploited in order to achieve speedups ranging from 5x (addition) to 9x (multiplication), while keeping resource usage of the reconfigurable device low, ranging from 11% to 19%

    A convolve-and-MErge approach for exact computations on high-performance reconfigurable computers

    Get PDF
    This work presents an approach for accelerating arbitrary-precision arithmetic on high-performance reconfigurable computers (HPRCs). Although faster and smaller, fixed-precision arithmetic has inherent rounding and overflow problems that can cause errors in scientific or engineering applications. This recurring phenomenon is usually referred to as numerical nonrobustness. Therefore, there is an increasing interest in the paradigmof exact computation, based on arbitrary-precision arithmetic. There are a number of libraries and/or languages supporting this paradigm, for example, the GNUmultiprecision (GMP) library. However, the performance of computations is significantly reduced in comparison to that of fixed-precision arithmetic. In order to reduce this performance gap, this paper investigates the acceleration of arbitrary-precision arithmetic on HPRCs. A Convolve-And-MErge approach is proposed, that implements virtual convolution schedules derived from the formal representation of the arbitraryprecision multiplication problem. Additionally, dynamic (nonlinear) pipeline techniques are also exploited in order to achieve speedups ranging from 5x (addition) to 9x (multiplication), while keeping resource usage of the reconfigurable device low, ranging from 11% to 19%

    Approaches for MATLAB Applications Acceleration Using High Performance Reconfigurable Computers

    Get PDF
    A lot of raw computing power is needed in many scientific computing applications and simulations. MATLABĀ®ā€  is one of the popular choices as a language for technical computing. Presented here are approaches for MATLAB based applications acceleration using High Performance Reconfigurable Computing (HPRC) machines. Typically, these are a cluster of Von Neumann architecture based systems with none or more FPGA reconfigurable boards. As a case study, an Image Correlation Algorithm has been ported on this architecture platform. As a second case study, the recursive training process in an Artificial Neural Network (ANN) to realize an optimum network has been accelerated, by porting it to HPC Systems. The approaches taken are analyzed with respect to target scenarios, end users perspective, programming efficiency and performance. Disclaimer: Some material in this text has been used and reproduced with appropriate references and permissions where required. ā€  MATLABĀ® is a registered trademark of The Mathworks, Inc. Ā©1994-2003

    High level compilation for gate reconfigurable architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (p. 205-215).A continuing exponential increase in the number of programmable elements is turning management of gate-reconfigurable architectures as "glue logic" into an intractable problem; it is past time to raise this abstraction level. The physical hardware in gate-reconfigurable architectures is all low level - individual wires, bit-level functions, and single bit registers - hence one should look to the fetch-decode-execute machinery of traditional computers for higher level abstractions. Ordinary computers have machine-level architectural mechanisms that interpret instructions - instructions that are generated by a high-level compiler. Efficiently moving up to the next abstraction level requires leveraging these mechanisms without introducing the overhead of machine-level interpretation. In this dissertation, I solve this fundamental problem by specializing architectural mechanisms with respect to input programs. This solution is the key to efficient compilation of high-level programs to gate reconfigurable architectures. My approach to specialization includes several novel techniques. I develop, with others, extensive bitwidth analyses that apply to registers, pointers, and arrays. I use pointer analysis and memory disambiguation to target devices with blocks of embedded memory. My approach to memory parallelization generates a spatial hierarchy that enables easier-to-synthesize logic state machines with smaller circuits and no long wires.(cont.) My space-time scheduling approach integrates the techniques of high-level synthesis with the static routing concepts developed for single-chip multiprocessors. Using DeepC, a prototype compiler demonstrating my thesis, I compile a new benchmark suite to Xilinx Virtex FPGAs. Resulting performance is comparable to a custom MIPS processor, with smaller area (40 percent on average), higher evaluation speeds (2.4x), and lower energy (18x) and energy-delay (45x). Specialization of advanced mechanisms results in additional speedup, scaling with hardware area, at the expense of power. For comparison, I also target IBM's standard cell SA-27E process and the RAW microprocessor. Results include sensitivity analysis to the different mechanisms specialized and a grand comparison between alternate targets.by Jonathan William Babb.Ph.D

    Energy-Efficient, Flexible and Fast Architectures for Deep Convolutional Neural Network Acceleration

    Get PDF
    RƉSUMƉ: Les mĆ©thodes basĆ©es sur l'apprentissage profond, et en particulier les rĆ©seaux de neurones convolutifs (CNN), ont rĆ©volutionnĆ© le domaine de la vision par ordinateur. Alors que jusqu'en 2012, les mĆ©thodes de traitement d'image traditionnelles les plus prĆ©cises pouvaient atteindre 26% d'erreurs dans la reconnaissance d'images sur l'Ć©talon normalisĆ© et bien connu ImageNet, une mĆ©thode basĆ©e sur un CNN a considĆ©rablement rĆ©duit l'erreur Ć  16%. En faisant Ć©voluer la structure des CNN, les mĆ©thodes actuelles basĆ©es sur des CNN atteignent dĆ©sormais couramment des taux d'erreur infĆ©rieurs Ć  3%, dĆ©passant souvent la prĆ©cision humaine. Les CNN se composent de nombreuses couches convolutives, chacune effectuant des opĆ©rations de convolution complexes de haute dimension. Pour obtenir une prĆ©cision Ć©levĆ©e en reconnaissance dā€™images, les CNN modernes empilent de nombreuses couches convolutives, ce qui augmente considĆ©rablement la diversitĆ© des motifs de calcul entre les couches. Ce haut niveau de complexitĆ© dans les CNN implique un nombre massif de paramĆØtres et de calculs.----------ABSTRACT: Deep learning-based methods, and specifically Convolutional Neural Networks (CNNs), have revolutionized the field of computer vision. While until 2012, the most accurate traditional image processing methods could reach 26% errors in recognizing images on the standardized and well-known ImageNet benchmark, a CNN-based method dramatically reduced the error to 16%. By evolving CNNs structures, current CNN-based methods now routinely achieve error rates below 3%, often outperforming human level accuracy. CNNs consist of many convolutional layers each performing high dimensional complex convolution operations. To achieve high image recognition accuracy, modern CNNs stack many convolutional layers which dramatically increases computation pattern diversity across layers. This high level of complexity in CNNs implies massive numbers of parameters and computations. Since mobile processors are not designed to perform massive computations, deploying CNNs on portable and mobile devices is challenging

    Embedded electronic systems driven by run-time reconfigurable hardware

    Get PDF
    Abstract This doctoral thesis addresses the design of embedded electronic systems based on run-time reconfigurable hardware technology ā€“available through SRAM-based FPGA/SoC devicesā€“ aimed at contributing to enhance the life quality of the human beings. This work does research on the conception of the system architecture and the reconfiguration engine that provides to the FPGA the capability of dynamic partial reconfiguration in order to synthesize, by means of hardware/software co-design, a given application partitioned in processing tasks which are multiplexed in time and space, optimizing thus its physical implementation ā€“silicon area, processing time, complexity, flexibility, functional density, cost and power consumptionā€“ in comparison with other alternatives based on static hardware (MCU, DSP, GPU, ASSP, ASIC, etc.). The design flow of such technology is evaluated through the prototyping of several engineering applications (control systems, mathematical coprocessors, complex image processors, etc.), showing a high enough level of maturity for its exploitation in the industry.Resumen Esta tesis doctoral abarca el diseƱo de sistemas electrĆ³nicos embebidos basados en tecnologĆ­a hardware dinĆ”micamente reconfigurable ā€“disponible a travĆ©s de dispositivos lĆ³gicos programables SRAM FPGA/SoCā€“ que contribuyan a la mejora de la calidad de vida de la sociedad. Se investiga la arquitectura del sistema y del motor de reconfiguraciĆ³n que proporcione a la FPGA la capacidad de reconfiguraciĆ³n dinĆ”mica parcial de sus recursos programables, con objeto de sintetizar, mediante codiseƱo hardware/software, una determinada aplicaciĆ³n particionada en tareas multiplexadas en tiempo y en espacio, optimizando asĆ­ su implementaciĆ³n fĆ­sica ā€“Ć”rea de silicio, tiempo de procesado, complejidad, flexibilidad, densidad funcional, coste y potencia disipadaā€“ comparada con otras alternativas basadas en hardware estĆ”tico (MCU, DSP, GPU, ASSP, ASIC, etc.). Se evalĆŗa el flujo de diseƱo de dicha tecnologĆ­a a travĆ©s del prototipado de varias aplicaciones de ingenierĆ­a (sistemas de control, coprocesadores aritmĆ©ticos, procesadores de imagen, etc.), evidenciando un nivel de madurez viable ya para su explotaciĆ³n en la industria.Resum Aquesta tesi doctoral estĆ  orientada al disseny de sistemes electrĆ²nics empotrats basats en tecnologia hardware dinĆ micament reconfigurable ā€“disponible mitjanƧant dispositius lĆ²gics programables SRAM FPGA/SoCā€“ que contribueixin a la millora de la qualitat de vida de la societat. Sā€™investiga lā€™arquitectura del sistema i del motor de reconfiguraciĆ³ que proporcioni a la FPGA la capacitat de reconfiguraciĆ³ dinĆ mica parcial dels seus recursos programables, amb lā€™objectiu de sintetitzar, mitjanƧant codisseny hardware/software, una determinada aplicaciĆ³ particionada en tasques multiplexades en temps i en espai, optimizant aixĆ­ la seva implementaciĆ³ fĆ­sica ā€“Ć rea de silici, temps de processat, complexitat, flexibilitat, densitat funcional, cost i potĆØncia dissipadaā€“ comparada amb altres alternatives basades en hardware estĆ tic (MCU, DSP, GPU, ASSP, ASIC, etc.). Sā€™evalĆŗa el fluxe de disseny dā€™aquesta tecnologia a travĆ©s del prototipat de varies aplicacions dā€™enginyeria (sistemes de control, coprocessadors aritmĆØtics, processadors dā€™imatge, etc.), demostrant un nivell de maduresa viable ja per a la seva explotaciĆ³ a la indĆŗstria

    Algorithms and programming tools for image processing on the MPP, part 2

    Get PDF
    A number of algorithms were developed for image warping and pyramid image filtering. Techniques were investigated for the parallel processing of a large number of independent irregular shaped regions on the MPP. In addition some utilities for dealing with very long vectors and for sorting were developed. Documentation pages for the algorithms which are available for distribution are given. The performance of the MPP for a number of basic data manipulations was determined. From these results it is possible to predict the efficiency of the MPP for a number of algorithms and applications. The Parallel Pascal development system, which is a portable programming environment for the MPP, was improved and better documentation including a tutorial was written. This environment allows programs for the MPP to be developed on any conventional computer system; it consists of a set of system programs and a library of general purpose Parallel Pascal functions. The algorithms were tested on the MPP and a presentation on the development system was made to the MPP users group. The UNIX version of the Parallel Pascal System was distributed to a number of new sites

    Methodology for complex dataflow application development

    Get PDF
    This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist. The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance. The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms. The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces
    corecore