235,673 research outputs found

    Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

    Get PDF
    Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations

    Parallel Algorithms for the Maximum Flow

    Get PDF
    The problem of finding the maximal flow through a given network has been intensively studied over the years. The classic algorithm for this problem given by Ford and Fulkerson has been developed and improved by a number of authors including Edmonds and Karp. With the advent of parallel computers, it is of great interest to see whether more efficient algorithms can be designed and implemented. The networks which we will consider will be both capacitated and bounded. Compared with a capacitated network, the problem of finding a flow through a bounded network is much more complicated in that a transformation into an auxiliary network is required before a feasible flow can be found. In this thesis, we review the algorithms of Ford and Fulkerson and Edmonds and Karp and implement them in a standard sequential way. We also implement the transformation required to handle the case of a bounded network. We then develop two parallel algorithms, the first being a parallel version of the Edmonds and Karp algorithm while the second applies the Breadth-First search technique to extract as much parallelism as possible from the problem. Both these algorithms have been written in the Occam programming language and implemented on a transputer system consisting of an IBM PC host, a B004 single transputer board and a network of four transputers contained on a B003 board supplied by Inmos Ltd. This is an example of a multiprocessor machine with independent memory. The relative efficiency of the algorithms has been studied and we present tables of the execution times taken over a variety of test networks. The transformation of the original network into an auxiliary network has also been implemented using parallel techniques and the problems encountered in the development of the algorithm are described. We have also investigated in detail one of the few parallel algorithms for this problem described in the literature due to Shiloach and Vishkin. This algorithm is described in the thesis. It has not been possible to implement this algorithm because it is specifically designed to run on a multiprocessor machine with shared memory

    Efficient Implementation of a Synchronous Parallel Push-Relabel Algorithm

    Full text link
    Motivated by the observation that FIFO-based push-relabel algorithms are able to outperform highest label-based variants on modern, large maximum flow problem instances, we introduce an efficient implementation of the algorithm that uses coarse-grained parallelism to avoid the problems of existing parallel approaches. We demonstrate good relative and absolute speedups of our algorithm on a set of large graph instances taken from real-world applications. On a modern 40-core machine, our parallel implementation outperforms existing sequential implementations by up to a factor of 12 and other parallel implementations by factors of up to 3

    A computational evaluation of constructive and improvement heuristics for the blocking flow shop to minimize total flowtime

    Get PDF
    This paper focuses on the blocking flow shop scheduling problem with the objective of total flowtime minimisation. This problem assumes that there are no buffers between machines and, due to its application to many manufacturing sectors, it is receiving a growing attention by researchers during the last years. Since the problem is NP-hard, a large number of heuristics have been proposed to provide good solutions with reasonable computational times. In this paper, we conduct a comprehensive evaluation of the available heuristics for the problem and for related problems, resulting in the implementation and testing of a total of 35 heuristics. Furthermore, we propose an efficient constructive heuristic which successfully combines a pool of partial sequences in parallel, using a beam-search-based approach. The computational experiments show the excellent performance of the proposed heuristic as compared to the best-so-far algorithms for the problem, both in terms of quality of the solutions and of computational requirements. In fact, despite being a relative fast constructive heuristic, new best upper bounds have been found for more than 27% of Taillard’s instances.Ministerio de Ciencia e Innovación DPI2013-44461-P/DP

    Efficient heuristics for the parallel blocking flow shop scheduling problem

    Get PDF
    We consider the NP-hard problem of scheduling n jobs in F identical parallel flow shops, each consisting of a series of m machines, and doing so with a blocking constraint. The applied criterion is to minimize the makespan, i.e., the maximum completion time of all the jobs in F flow shops (lines). The Parallel Flow Shop Scheduling Problem (PFSP) is conceptually similar to another problem known in the literature as the Distributed Permutation Flow Shop Scheduling Problem (DPFSP), which allows modeling the scheduling process in companies with more than one factory, each factory with a flow shop configuration. Therefore, the proposed methods can solve the scheduling problem under the blocking constraint in both situations, which, to the best of our knowledge, has not been studied previously. In this paper, we propose a mathematical model along with some constructive and improvement heuristics to solve the parallel blocking flow shop problem (PBFSP) and thus minimize the maximum completion time among lines. The proposed constructive procedures use two approaches that are totally different from those proposed in the literature. These methods are used as initial solution procedures of an iterated local search (ILS) and an iterated greedy algorithm (IGA), both of which are combined with a variable neighborhood search (VNS). The proposed constructive procedure and the improved methods take into account the characteristics of the problem. The computational evaluation demonstrates that both of them –especially the IGA– perform considerably better than those algorithms adapted from the DPFSP literature.Peer ReviewedPostprint (author's final draft

    Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming

    Get PDF
    Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware. This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model. This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP -- an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs.~ a serial processor and 8.10x vs.~parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms
    • …
    corecore