226 research outputs found

    Acceleration Methodology for the Implementation of Scientific Applications on Reconfigurable Hardware

    Get PDF
    The role of heterogeneous multi-core architectures in the industrial and scientific computing community is expanding. For researchers to increase the performance of complex applications, a multifaceted approach is needed to utilize emerging reconfigurable computing (RC) architectures. First, the method for accelerating applications must provide flexible solutions for fully utilizing key architecture traits across platforms. Secondly, the approach needs to be readily accessible to application scientists. A recent trend toward emerging disruptive architectures is an important signal that fundamental limitations in traditional high performance computing (HPC) are limiting break through research. To respond to these challenges, scientists are under pressure to identify new programming methodologies and elements in platform architectures that will translate into enhanced program efficacy. Reconfigurable computing (RC) allows the implementation of almost any computer architecture trait, but identifying which traits work best for numerous scientific problem domains is difficult. However, by leveraging the existing underlying framework available in field programmable gate arrays (FPGAs), it is possible to build a method for utilizing RC traits for accelerating scientific applications. By contrasting both hardware and software changes, RC platforms afford developers the ability to examine various architecture characteristics to find those best suited for production-level scientific applications. The flexibility afforded by FPGAs allow these characteristics to then be extrapolated to heterogeneous, multi-core and general-purpose computing on graphics processing units (GP-GPU) HPC platforms. Additionally by coupling high-level languages (HLL) with reconfigurable hardware, relevance to a wider industrial and scientific population is achieved. To provide these advancements to the scientific community we examine the acceleration of a scientific application on a RC platform. By leveraging the flexibility provided by FPGAs we develop a methodology that removes computational loads from host systems and internalizes portions of communication with the aim of reducing fiscal costs through the reduction of physical compute nodes required to achieve the same runtime performance. Using this methodology an improvement in application performance is shown to be possible without requiring hand implementation of HLL code in a hardware description language (HDL) A review of recent literature demonstrates the challenge of developing a platform-independent flexible solution that allows access to cutting edge RC hardware for application scientists. To address this challenge we propose a structured methodology that begins with examination of the application\u27s profile, computations, and communications and utilizes tools to assist the developer in making partitioning and optimization decisions. Through experimental results, we will analyze the computational requirements, describe the simulated and actual accelerated application implementation, and finally describe problems encountered during development. Using this proposed method, a 3x speedup is possible over the entire accelerated target application. Lastly we discuss possible future work including further potential optimizations of the application to improve this process and project the anticipated benefits

    Acceleration of Biomolecular Simulations using FPGA-based Reconfigurable Computing

    Get PDF
    A paradigm shift is occurring in the way compute-intensive scientific applications are developed. Thanks to advancements in commercially viable hybrid architectures for High-Performance Computing (HPC), the focus has shifted from improving performance by merely scaling algorithms on von Neumann computing nodes to fully exploiting additional computational capabilities provided by accelerators such as FPGAs (Field Programmable Gate Arrays) and GPGPUs (General Purpose Graphical Processing Units). Computational chemists use Molecular Dynamics (MD) simulations like LAMMPS (Large Scale Atomic Molecular Massively Parallel Systems) and NAMD (NAnoscale Molecular Dynamics) to simulate biomolecular behaviour such as protein folding and small molecule docking to proteins. MD simulations are computationally complex n-body problems, which are time consuming to simulate in biologically relevant scales. Executing such simulations in best available HPC environments is critical for scientific advancements in the field. Thus, as HPC technology evolves, there is a need to update classical biomolecular simulation applications like LAMMPS to better suit the architecture. In this work, we modify LAMMPS (a classical molecular dynamics simulation program developed for CPU-only clusters) to execute on a reconfigurable computer system, SRC-7 H MAP. The SRC-7 H MAP consists of two Altera FPGA logic chips interfaced to a dual-core Intel Xeon processor. Users can benefit by offloading most compute-intensive tasks of the application to the FPGA logic. This work explores the challenges involved in effectively adapting a production level application code optimized for von Neumann architecture, to an FPGA-based hybrid architecture. We have successfully accelerated the non-bonded force computations, the most compute-intensive module in LAMMPS for biomolecular simulations, by 5.0x over a single 3.0 GHz Xeon processor. This performance includes the data transfer overheads and function calling overheads. Further, using the accelerated non-bonded force computations function, we achieve an overall application speed-up of 2.0x to 2.4

    FPGA acceleration of sequence analysis tools in bioinformatics

    Full text link
    Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools

    Accelerated long range electrostatics computations on single and multiple FPGAs

    Get PDF
    Classical Molecular Dynamics simulation (MD) models the interactions of thousands to millions of particles through the iterative application of basic Physics. MD is one of the core methods in High Performance Computing (HPC). While MD is critical to many high-profile applications, e.g. drug discovery and design, it suffers from the strong scaling problem, that is, while large computer systems can efficiently model large ensembles of particles, it is extremely challenging for {\it any} computer system to increase the timescale, even for small ensembles. This strong scaling problem can be mitigated with low-latency, direct communication. Of all Commercial Off the Shelf (COTS) Integrated Circuits (ICs), Field Programmable Gate Arrays (FPGAs) are the computational component uniquely applicable here: they have unmatched parallel communication capability both within the chip and externally to couple clusters of FPGAs. This thesis focuses on the acceleration of the long range (LR) force, the part of MD most difficult to scale, by using FPGAs. This thesis first optimizes LR acceleration on a single-FPGA to eliminate the amount of on-chip communication required to complete a single LR computation iteration while maintaining as much parallelism as possible. This is achieved by designing around application specific memory architectures. Doing so introduces data movement issues overcome by pipelined, toroidal-shift multiplexing (MUXing) and pipelined staggering of memory access subsets. This design is then evaluated comprehensively and comparatively, deriving equations for performance and resource consumption and drawing metrics from previously developed LR hardware designs. Using this single-FPGA LR architecture as a base, FPGA network strategies to compute the LR portion of larger sized MD problems are then theorized and analyzed
    • 

    corecore