404 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving

    A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop Scheduling Problem

    Get PDF
    Branch-and-Bound (B&B) algorithms are time intensive tree-based exploration methods for solving to optimality combinatorial optimization problems. In this paper, we investigate the use of GPU computing as a major complementary way to speed up those methods. The focus is put on the bounding mechanism of B&B algorithms, which is the most time consuming part of their exploration process. We propose a parallel B&B algorithm based on a GPU-accelerated bounding model. The proposed approach concentrate on optimizing data access management to further improve the performance of the bounding mechanism which uses large and intermediate data sets that do not completely fit in GPU memory. Extensive experiments of the contribution have been carried out on well known FSP benchmarks using an Nvidia Tesla C2050 GPU card. We compared the obtained performances to a single and a multithreaded CPU-based execution. Accelerations up to x100 are achieved for large problem instances

    Advanced semantics for accelerated graph processing

    Get PDF
    Large-scale graph applications are of great national, commercial, and societal importance, with direct use in fields such as counter-intelligence, proteomics, and data mining. Unfortunately, graph-based problems exhibit certain basic characteristics that make them a poor match for conventional computing systems in terms of structure, scale, and semantics. Graph processing kernels emphasize sparse data structures and computations with irregular memory access patterns that destroy the temporal and spatial locality upon which modern processors rely for performance. Furthermore, applications in this area utilize large data sets, and have been shown to be more data intensive than typical floating-point applications, two properties that lead to inefficient utilization of the hierarchical memory system. Current approaches to processing large graph data sets leverage traditional HPC systems and programming models, for shared memory and message-passing computation, and are thus limited in efficiency, scalability, and programmability. The research presented in this thesis investigates the potential of a new model of execution that is hypothesized as a promising alternative for graph-based applications to conventional practices. A new approach to graph processing is developed and presented in this thesis. The application of the experimental ParalleX execution model to graph processing balances continuation-migration style fine-grain concurrency with constraint-based synchronization through embedded futures. A collection of parallel graph application kernels provide experiment control drivers for analysis and evaluation of this innovative strategy. Finally, an experimental software library for scalable graph processing, the ParalleX Graph Library, is defined using the HPX runtime system, providing an implementation of the key concepts and a framework for development of ParalleX-based graph applications

    Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

    Full text link

    Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver

    Get PDF
    International audienceThe advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work

    SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators

    Get PDF
    This paper presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average, 2.29x faster than state-of-the-art HLS methodologies
    corecore