10 research outputs found

    Accelerated long range electrostatics computations on single and multiple FPGAs

    Get PDF
    Classical Molecular Dynamics simulation (MD) models the interactions of thousands to millions of particles through the iterative application of basic Physics. MD is one of the core methods in High Performance Computing (HPC). While MD is critical to many high-profile applications, e.g. drug discovery and design, it suffers from the strong scaling problem, that is, while large computer systems can efficiently model large ensembles of particles, it is extremely challenging for {\it any} computer system to increase the timescale, even for small ensembles. This strong scaling problem can be mitigated with low-latency, direct communication. Of all Commercial Off the Shelf (COTS) Integrated Circuits (ICs), Field Programmable Gate Arrays (FPGAs) are the computational component uniquely applicable here: they have unmatched parallel communication capability both within the chip and externally to couple clusters of FPGAs. This thesis focuses on the acceleration of the long range (LR) force, the part of MD most difficult to scale, by using FPGAs. This thesis first optimizes LR acceleration on a single-FPGA to eliminate the amount of on-chip communication required to complete a single LR computation iteration while maintaining as much parallelism as possible. This is achieved by designing around application specific memory architectures. Doing so introduces data movement issues overcome by pipelined, toroidal-shift multiplexing (MUXing) and pipelined staggering of memory access subsets. This design is then evaluated comprehensively and comparatively, deriving equations for performance and resource consumption and drawing metrics from previously developed LR hardware designs. Using this single-FPGA LR architecture as a base, FPGA network strategies to compute the LR portion of larger sized MD problems are then theorized and analyzed

    Using offline routing to implement a low latenc 3D FFT in a multinode FPGA system

    Full text link
    Thesis (M.S.)--Boston UniversityApplications that require highly parallel computing along with low latency communication due to strong scaling, such as a calculating a 3D FFT for Molecular Dynamics simulations, can be problematic for traditional high performance computing (HPC) clusters. A multinode FPGA array is a good solution for these types of problems due to the direct high speed connections and flexible internal fabric inherent in FPGAs. Offline routing uses precomputed routing information to direct packets and can avoid much of the switching and congestion communication overhead. Two architectures are explored here which show the feasibility ofusing offline routing techniques to reduce communication latencies in FPGA systems. The first architecture targets a single FPGA that was built for initial exploration and to show how the powerful and flexible a single FPGA can be. It attained a maximum clock frequency of 102MHz and latencies of 64us and 250us for 3D FFT calculations of 32^3 and 64^3 data points respectively. The second architecture targets an FPGA that is intended to be the model for each node in the array. The best multinode version is based on a multilevel switching architecture. It has a maximum clock frequency of 134MHz. When scaled to a cluster, latencies project to 2.4us and 5.5us for 3D FFT calculations of 32^3 and 64^3 data points respectively. The two designs show the potential for using a single FPGA and multi-FPGA arrays for HPC applications where communication latency is critical to the application

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    FPGA-based range-limited molecular dynamics acceleration

    Full text link
    Molecular Dynamics (MD) is a computer simulation technique that executes iteratively over discrete, infinitesimal time intervals. It has been a widely utilized application in the fields of material sciences and computer-aided drug design for many years, serving as a crucial benchmark in high-performance computing (HPC). Numerous MD packages have been developed and effectively accelerated using GPUs. However, as the limits of Moore's Law are reached, the performance of an individual computing node has reached its bottleneck, while the performance of multiple nodes is primarily hindered by scalability issues, particularly when dealing with small datasets. In this thesis, the acceleration with respect to small datasets is the main focus. With the recent COVID-19 pandemic, drug discovery has gained significant attention, and Molecular Dynamics (MD) has emerged as a crucial tool in this process. Particularly, in the critical domain of drug discovery, small simulations involving approximately ~50K particles are frequently employed. However, it is important to note that small simulations do not necessarily translate to faster results, as long-term simulations comprising billions of MD iterations and more are essential in this context. In addition to dataset size, the problem of interest is further constrained. Referred to as the most computationally demanding aspect of MD, the evaluation of range-limited (RL) forces not only accounts for 90% of the MD computation workload but also involves irregular mapping patterns of 3-D data onto 2-D processor networks. To emphasize, this thesis centers around the acceleration of RL MD specifically for small datasets. In order to address the single-node bottleneck and multi-node scaling challenges, the thesis is organized into two progressive stages of investigation. The first stage delves extensively into enhancing single-node efficiency by examining various factors such as workload mapping from 3-D to 2-D, data routing, and data locality. The second stage focuses on studying multi-node scalability, with a particular emphasis on strong scaling, bandwidth demands, and the synchronization mechanisms between nodes. Through our study, the results show our design on a Xilinx U280 FPGA achieves 51.72x and 4.17x speedups with respect to an Intel Xeon Gold 6226R CPU, and a Quadro RTX 8000 GPU. Our research towards strong scaling also demonstrates that 8 Xilinx U280 FPGAs connected to a switch achieves 4.67x speedup compared to an Nvidia V100 GP

    A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton

    No full text

    Design of hFGF1 Variant(s) with Increased Stability and Enhanced Bioactivity

    Get PDF
    Fibroblast growth factors (FGFs) are involved in various cellular processes such as cell growth,proliferation, differentiation, migration, angiogenesis, wound healing and embryonic development. Human acidic fibroblast growth factor (hFGF1) binds non-selectively to all the four FGF-receptors and is therefore considered as a powerful mitogen with broadest specificity. However, pharmacological applications of hFGF1 are restricted due to the low thermal stability of the growth factor. hFGF1 has low thermodynamic stability under physiological temperatures which leads to impairment of cellular signaling process thereby preventing its potential mitogenic properties. hFGF1 has a heparin binding pocket at the C-terminus which comprises of positively charges residues. The interaction between the positively charged amino acids lead to electrostatic repulsions, thereby rendering instability. To overcome this instability, hFGF1 binds to the glycosaminoglycan, heparin which decreases the repulsion (s) between the positively charged residues. However, binding of heparin poses a challenge for the use of hFGF1 in wound healing. Thrombin converts fibrinogen to fibrin and works as first line of defense by blocking the loss of blood. Intriguingly, thrombin also binds to heparin. Studies on wtFGF1 have demonstrated the presence of secondary thrombin cleavage site in hFGF1. Thus, thrombin is known to cleave hFGF1 at Arg 136 and render it biologically inactive. Usually, it is considered that dependency of hFGF1 to heparin increases the plausibility of thrombin-induced degradation of the growth factor. To tackle these downfalls, I have designed and constructed several point mutations in hFGF1 to improve the thermal stability and cell proliferation ability and to subside the heparin binding affinity of the growth factor. In this dissertation, I studied single, double, triple, quadruple, and penta variants of Q54P, S61L, H107S, K126N, and R136E and examined the thermal stability, bioactivity, and heparin dependency of the protein. These studies indicate that site - directed mutagenesis in hFGF1 can impact the inherent stability of the growth factor and role of heparin in hFGF1-FGFR receptor interaction and activation
    corecore