933 research outputs found

    Mapping applications onto FPGA-centric clusters

    Full text link
    High Performance Computing (HPC) is becoming increasingly important throughout science and engineering as ever more complex problems must be solved through computational simulations. In these large computational applications, the latency of communication between processing nodes is often the key factor that limits performance. An emerging alternative computer architecture that addresses the latency problem is the FPGA-centric cluster (FCC); in these systems, the devices (FPGAs) are directly interconnected and thus many layers of hardware and software are avoided. The result can be scalability not currently achievable with other technologies. In FCCs, FPGAs serve multiple functions: accelerator, network interface card (NIC), and router. Moreover, because FPGAs are configurable, there is substantial opportunity to tailor the router hardware to the application; previous work has demonstrated that such application-aware configuration can effect a substantial improvement in hardware efficiency. One constraint of FCCs is that it is convenient for their interconnect to be static, direct, and have a two or three dimensional mesh topology. Thus, applications that are naturally of a different dimensionality (have a different logical topology) from that of the FCC must be remapped to obtain optimal performance. In this thesis we study various aspects of the mapping problem for FCCs. There are two major research thrusts. The first is finding the optimal mapping of logical to physical topology. This problem has received substantial attention by both the theory community, where topology mapping is referred to as graph embedding, and by the High Performance Computing (HPC) community, where it is a question of process placement. We explore the implications of the different mapping strategies on communication behavior in FCCs, especially on resulting load imbalance. The second major research thrust is built around the hypothesis that applications that need to be remapped (due to differing logical and physical topologies) will have different optimal router configurations from those applications that do not. For example, due to remapping, some virtual or physical communication links may have little occupancy; therefore fewer resources should be allocated to them. Critical here is the creation of a new set of parameterized hardware features that can be configured to best handle load imbalances caused by remapping. These two thrusts form a codesign loop: certain mapping algorithms may be differentially optimal due to application-aware router reconfiguration that accounts for this mapping. This thesis has four parts. The first part introduces the background and previous work related to communication in general and, in particular, how it is implemented in FCCs. We build on previous work on application-aware router configuration. The second part introduces topology mapping mechanisms including those derived from graph embeddings and a greedy algorithm commonly used in HPC. In the third part, topology mappings are evaluated for performance and imbalance; we note that different mapping strategies lead to different imbalances both in the overall network and in each node. The final part introduces reconfigure router design that allocates resources based on different imbalance situations caused by different mapping behaviors

    Efficient parallel processing with optical interconnections

    Get PDF
    With the advances in VLSI technology, it is now possible to build chips which can each contain thousands of processors. The efficiency of such chips in executing parallel algorithms heavily depends on the interconnection topology of the processors. It is not possible to build a fully interconnected network of processors with constant fan-in/fan-out using electrical interconnections. Free space optics is a remedy to this limitation. Qualities exclusive to the optical medium are its ability to be directed for propagation in free space and the property that optical channels can cross in space without any interference. In this thesis, we present an electro-optical interconnected architecture named Optical Reconfigurable Mesh (ORM). It is based on an existing optical model of computation. There are two layers in the architecture. The processing layer is a reconfigurable mesh and the deflecting layer contains optical devices to deflect light beams. ORM provides three types of communication mechanisms. The first is for arbitrary planar connections among sets of locally connected processors using the reconfigurable mesh. The second is for arbitrary connections among N of the processors using the electrical buses on the processing layer and N2 fixed passive deflecting units on the deflection layer. The third is for arbitrary connections among any of the N2 processors using the N2 mechanically reconfigurable deflectors in the deflection layer. The third type of communication mechanisms is significantly slower than the other two. Therefore, it is desirable to avoid reconfiguring this type of communication during the execution of the algorithms. Instead, the optical reconfiguration can be done before the execution of each algorithm begins. Determining a right configuration that would be suitable for the entire configuration of a task execution is studied in this thesis. The basic data movements for each of the mechanisms are studied. Finally, to show the power of ORM, we use all three types of communication mechanisms in the first O(logN) time algorithm for finding the convex hulls of all figures in an N x N binary image presented in this thesis

    A Finite Domain Constraint Approach for Placement and Routing of Coarse-Grained Reconfigurable Architectures

    Get PDF
    Scheduling, placement, and routing are important steps in Very Large Scale Integration (VLSI) design. Researchers have developed numerous techniques to solve placement and routing problems. As the complexity of Application Specific Integrated Circuits (ASICs) increased over the past decades, so did the demand for improved place and route techniques. The primary objective of these place and route approaches has typically been wirelength minimization due to its impact on signal delay and design performance. With the advent of Field Programmable Gate Arrays (FPGAs), the same place and route techniques were applied to FPGA-based design. However, traditional place and route techniques may not work for Coarse-Grained Reconfigurable Architectures (CGRAs), which are reconfigurable devices offering wider path widths than FPGAs and more flexibility than ASICs, due to the differences in architecture and routing network. Further, the routing network of several types of CGRAs, including the Field Programmable Object Array (FPOA), has deterministic timing as compared to the routing fabric of most ASICs and FPGAs reported in the literature. This necessitates a fresh look at alternative approaches to place and route designs. This dissertation presents a finite domain constraint-based, delay-aware placement and routing methodology targeting an FPOA. The proposed methodology takes advantage of the deterministic routing network of CGRAs to perform a delay aware placement

    RMESH Algorithms for Parallel String Matching

    Get PDF
    String matching problem received much attention over the years due to its importance in various applications such as text/file comparison, DNA sequencing, search engines, and spelling correction. Especially with the introduction of search engines dealing with tremendous amount of textual information presented on the world wide web and the research on DNA sequencing, this problem deserves special attention and any algorithmic or hardware improvements to speed up the process will benefit these important applications. In this paper, we present three algorithms for string matching on reconfigurable mesh architectures. Given a text T of length n and a pattern P of length m, the first algorithm finds the exact matching between T and P in O(1) time on a 2-dimensional RMESH of size (n-m+1) * m. The second algorithm finds the approximate matching between T and P in O(k) time on a 2D RMESH, where k is the maximum edit distance between T and P. The third algorithm allows only the replacement operation in the calculation of the edit distance and finds an approximate matching between T and P in constant-time on a 3D RMESH

    Spectral element methods: Algorithms and architectures

    Get PDF
    Spectral element methods are high-order weighted residual techniques for partial differential equations that combine the geometric flexibility of finite element methods with the rapid convergence of spectral techniques. Spectral element methods are described for the simulation of incompressible fluid flows, with special emphasis on implementation of spectral element techniques on medium-grained parallel processors. Two parallel architectures are considered: the first, a commercially available message-passing hypercube system; the second, a developmental reconfigurable architecture based on Geometry-Defining Processors. High parallel efficiency is obtained in hypercube spectral element computations, indicating that load balancing and communication issues can be successfully addressed by a high-order technique/medium-grained processor algorithm-architecture coupling

    Efficient Molecular Dynamics Simulation on Reconfigurable Models with MultiGrid Method

    Get PDF
    In the field of biology, MD simulations are continuously used to investigate biological studies. A Molecular Dynamics (MD) system is defined by the position and momentum of particles and their interactions. The dynamics of a system can be evaluated by an N-body problem and the simulation is continued until the energy reaches equilibrium. Thus, solving the dynamics numerically and evaluating the interaction is computationally expensive even for a small number of particles in the system. We are focusing on long-ranged interactions, since the calculation time is O(N^2) for an N particle system. In this dissertation, we are proposing two research directions for the MD simulation. First, we design a new variation of Multigrid (MG) algorithm called Multi-level charge assignment (MCA) that requires O(N) time for accurate and efficient calculation of the electrostatic forces. We apply MCA and back interpolation based on the structure of molecules to enhance the accuracy of the simulation. Our second research utilizes reconfigurable models to achieve fast calculation time. We have been working on exploiting two reconfigurable models. We design FPGA-based MD simulator implementing MCA method for Xilinx Virtex-IV. It performs about 10 to 100 times faster than software implementation depending on the simulation accuracy desired. We also design fast and scalable Reconfigurable mesh (R-Mesh) algorithms for MD simulations. This work demonstrates that the large scale biological studies can be simulated in close to real time. The R-Mesh algorithms we design highlight the feasibility of these models to evaluate potentials with faster calculation times. Specifically, we develop R-Mesh algorithms for both Direct method and Multigrid method. The Direct method evaluates exact potentials and forces, but requires O(N^2) calculation time for evaluating electrostatic forces on a general purpose processor. The MG method adopts an interpolation technique to reduce calculation time to O(N) for a given accuracy. However, our R-Mesh algorithms require only O(N) or O(logN) time complexity for the Direct method on N linear R-Mesh and N¡¿N R-Mesh, respectively and O(r)+O(logM) time complexity for the Multigrid method on an X¡¿Y¡¿Z R-Mesh. r is N/M and M = X¡¿Y¡¿Z is the number of finest grid points

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Solution of partial differential equations on vector and parallel computers

    Get PDF
    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
    corecore