229 research outputs found

    Efficient Generating And Processing Of Large-Scale Unstructured Meshes

    Get PDF
    Unstructured meshes are used in a variety of disciplines to represent simulations and experimental data. Scientists who want to increase accuracy of simulations by increasing resolution must also increase the size of the resulting dataset. However, generating and processing a extremely large unstructured meshes remains a barrier. Researchers have published many parallel Delaunay triangulation (DT) algorithms, often focusing on partitioning the initial mesh domain, so that each rectangular partition can be triangulated in parallel. However, the comproblems for this method is how to merge all triangulated partitions into a single domain-wide mesh or the significant cost for communication the sub-region borders. We devised a novel algorithm --Triangulation of Independent Partitions in Parallel (TIPP) to deal with very large DT problems without requiring inter-processor communication while still guaranteeing the Delaunay criteria. The core of the algorithm is to find a set of independent} partitions such that the circumcircles of triangles in one partition do not enclose any vertex in other partitions. For this reason, this set of independent partitions can be triangulated in parallel without affecting each other. The results of mesh generation is the large unstructured meshes including vertex index and vertex coordinate files which introduce a new challenge \-- locality. Partitioning unstructured meshes to improve locality is a key part of our own approach. Elements that were widely scattered in the original dataset are grouped together, speeding data access. For further improve unstructured mesh partitioning, we also described our new approach. Direct Load which mitigates the challenges of unstructured meshes by maximizing the proportion of useful data retrieved during each read from disk, which in turn reduces the total number of read operations, boosting performance

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

    DynaProg for Scala

    Get PDF
    Dynamic programming is an algorithmic technique to solve problems that follow the Bellman’s principle: optimal solutions depends on optimal sub-problem solutions. The core idea behind dynamic programming is to memoize intermediate results into matrices to avoid multiple computations. Solving a dynamic programming problem consists of two phases: filling one or more matrices with intermediate solutions for sub-problems and recomposing how the final result was constructed (backtracking). In textbooks, problems are usually described in terms of recurrence relations between matrices elements. Expressing dynamic programming problems in terms of recursive formulae involving matrix indices might be difficult, if often error prone, and the notation does not capture the essence of the underlying problem (for example aligning two sequences). Moreover, writing correct and efficient parallel implementation requires different competencies and often a significant amount of time. In this project, we present DynaProg, a language embedded in Scala (DSL) to address dynamic programming problems on heterogeneous platforms. DynaProg allows the programmer to write concise programs based on ADP [1], using a pair of parsing grammar and algebra; these program can then be executed either on CPU or on GPU. We evaluate the performance of our implementation against existing work and our own hand-optimized baseline implementations for both the CPU and GPU versions. Experimental results show that plain Scala has a large overhead and is recommended to be used with small sequences (≤1024) whereas the generated GPU version is comparable with existing implementations: matrix chain multiplication has the same performance as our hand-optimized version (142% of the execution time of [2]) for a sequence of 4096 matrices, Smith-Waterman is twice slower than [3] on a pair of sequences of 6144 elements, and RNA folding is on par with [4] (95% running time) for sequences of 4096 elements. [1] Robert Giegerich and Carsten Meyer. Algebraic Dynamic Programming. [2] Chao-Chin Wu, Jenn-Yang Ke, Heshan Lin and Wu Chun Feng. Optimizing dynamic programming on graphics processing units via adaptive thread-level parallelism. [3] Edans Flavius de O. Sandes, Alba Cristina M. A. de Melo. Smith-Waterman alignment of huge sequences with GPU in linear space. [4] Guillaume Rizk and Dominique Lavenier. GPU accelerated RNA folding algorithm

    Tiling Optimization For Nested Loops On Gpus

    Get PDF
    Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware. General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques. We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory. Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation. In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal. The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation. Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations. Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern

    Modern Optimization Algorithms and Applications: Architectural Layout Generation and Parallel Linear Programming

    Get PDF
    This thesis examines two topics from the field of computational optimization; architectural layout generation and parallel linear programming. The first topic, a modern problem in heuristic optimization, focuses on deriving a general form of the optimization problem and solving it with the proposed Evolutionary Treemap algorithm. Tests of the algorithm\u27s implementation within a highly scalable web application developed with Scala and the web service framework Play reveal the algorithm is effective at generated layouts in multiple styles. The second topic, a classical problem in operations research, focuses on methodologies for implementing the Simplex Algorithm on a parallel computer for solving large-scale linear programming problems. Implementations of the algorithm\u27s data-parallel and task parallel forms illuminate the ideal method for accelerating a solver. The proposed Multi-Path Simplex Algorithm shows an average speed up of over two times that of a popular open-source solver, showing it is an effective methodology for solving linear programming problems

    Acceleration of Computational Geometry Algorithms for High Performance Computing Based Geo-Spatial Big Data Analysis

    Get PDF
    Geo-Spatial computing and data analysis is the branch of computer science that deals with real world location-based data. Computational geometry algorithms are algorithms that process geometry/shapes and is one of the pillars of geo-spatial computing. Real world map and location-based data can be huge in size and the data structures used to process them extremely big leading to huge computational costs. Furthermore, Geo-Spatial datasets are growing on all V’s (Volume, Variety, Value, etc.) and are becoming larger and more complex to process in-turn demanding more computational resources. High Performance Computing is a way to breakdown the problem in ways that it can run in parallel on big computers with massive processing power and hence reduce the computing time delivering the same results but much faster.This dissertation explores different techniques to accelerate the processing of computational geometry algorithms and geo-spatial computing like using Many-core Graphics Processing Units (GPU), Multi-core Central Processing Units (CPU), Multi-node setup with Message Passing Interface (MPI), Cache optimizations, Memory and Communication optimizations, load balancing, Algorithmic Modifications, Directive based parallelization with OpenMP or OpenACC and Vectorization with compiler intrinsic (AVX). This dissertation has applied at least one of the mentioned techniques to the following problems. Novel method to parallelize plane sweep based geometric intersection for GPU with directives is presented. Parallelization of plane sweep based Voronoi construction, parallelization of Segment tree construction, Segment tree queries and Segment tree-based operations has been presented. Spatial autocorrelation, computation of getis-ord hotspots are also presented. Acceleration performance and speedup results are presented in each corresponding chapter

    Field D* pathfinding in weighted simplicial complexes

    Get PDF
    Includes abstract.Includes bibliographical references.The development of algorithms to efficiently determine an optimal path through a complex environment is a continuing area of research within Computer Science. When such environments can be represented as a graph, established graph search algorithms, such as Dijkstra’s shortest path and A*, can be used. However, many environments are constructed from a set of regions that do not conform to a discrete graph. The Weighted Region Problem was proposed to address the problem of finding the shortest path through a set of such regions, weighted with values representing the cost of traversing the region. Robust solutions to this problem are computationally expensive since finding shortest paths across a region requires expensive minimisation. Sampling approaches construct graphs by introducing extra points on region edges and connecting them with edges criss-crossing the region. Dijkstra or A* are then applied to compute shortest paths. The connectivity of these graphs is high and such techniques are thus not particularly well suited to environments where the weights and representation frequently change. The Field D* algorithm, by contrast, computes the shortest path across a grid of weighted square cells and has replanning capabilites that cater for environmental changes. However, representing an environment as a weighted grid (an image) is not space-efficient since high resolution is required to produce accurate paths through areas containing features sensitive to noise. In this work, we extend Field D* to weighted simplicial complexes – specifically – triangulations in 2D and tetrahedral meshes in 3D

    Accurate geometry reconstruction of vascular structures using implicit splines

    Get PDF
    3-D visualization of blood vessel from standard medical datasets (e.g. CT or MRI) play an important role in many clinical situations, including the diagnosis of vessel stenosis, virtual angioscopy, vascular surgery planning and computer aided vascular surgery. However, unlike other human organs, the vasculature system is a very complex network of vessel, which makes it a very challenging task to perform its 3-D visualization. Conventional techniques of medical volume data visualization are in general not well-suited for the above-mentioned tasks. This problem can be solved by reconstructing vascular geometry. Although various methods have been proposed for reconstructing vascular structures, most of these approaches are model-based, and are usually too ideal to correctly represent the actual variation presented by the cross-sections of a vascular structure. In addition, the underlying shape is usually expressed as polygonal meshes or in parametric forms, which is very inconvenient for implementing ramification of branching. As a result, the reconstructed geometries are not suitable for computer aided diagnosis and computer guided minimally invasive vascular surgery. In this research, we develop a set of techniques associated with the geometry reconstruction of vasculatures, including segmentation, modelling, reconstruction, exploration and rendering of vascular structures. The reconstructed geometry can not only help to greatly enhance the visual quality of 3-D vascular structures, but also provide an actual geometric representation of vasculatures, which can provide various benefits. The key findings of this research are as follows: 1. A localized hybrid level-set method of segmentation has been developed to extract the vascular structures from 3-D medical datasets. 2. A skeleton-based implicit modelling technique has been proposed and applied to the reconstruction of vasculatures, which can achieve an accurate geometric reconstruction of the vascular structures as implicit surfaces in an analytical form. 3. An accelerating technique using modern GPU (Graphics Processing Unit) is devised and applied to rendering the implicitly represented vasculatures. 4. The implicitly modelled vasculature is investigated for the application of virtual angioscopy

    Haptic Interaction with 3D oriented point clouds on the GPU

    Get PDF
    Real-time point-based rendering and interaction with virtual objects is gaining popularity and importance as di�erent haptic devices and technologies increasingly provide the basis for realistic interaction. Haptic Interaction is being used for a wide range of applications such as medical training, remote robot operators, tactile displays and video games. Virtual object visualization and interaction using haptic devices is the main focus; this process involves several steps such as: Data Acquisition, Graphic Rendering, Haptic Interaction and Data Modi�cation. This work presents a framework for Haptic Interaction using the GPU as a hardware accelerator, and includes an approach for enabling the modi�cation of data during interaction. The results demonstrate the limits and capabilities of these techniques in the context of volume rendering for haptic applications. Also, the use of dynamic parallelism as a technique to scale the number of threads needed from the accelerator according to the interaction requirements is studied allowing the editing of data sets of up to one million points at interactive haptic frame rates

    Graph Processing on GPU

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore