7 research outputs found

    Facilitating High Performance Code Parallelization

    Get PDF
    With the surge of social media on one hand and the ease of obtaining information due to cheap sensing devices and open source APIs on the other hand, the amount of data that can be processed is as well vastly increasing. In addition, the world of computing has recently been witnessing a growing shift towards massively parallel distributed systems due to the increasing importance of transforming data into knowledge in today’s data-driven world. At the core of data analysis for all sorts of applications lies pattern matching. Therefore, parallelizing pattern matching algorithms should be made efficient in order to cater to this ever-increasing abundance of data. We propose a method that automatically detects a user’s single threaded function call to search for a pattern using Java’s standard regular expression library, and replaces it with our own data parallel implementation using Java bytecode injection. Our approach facilitates parallel processing on different platforms consisting of shared memory systems (using multithreading and NVIDIA GPUs) and distributed systems (using MPI and Hadoop). The major contributions of our implementation consist of reducing the execution time while at the same time being transparent to the user. In addition to that, and in the same spirit of facilitating high performance code parallelization, we present a tool that automatically generates Spark Java code from minimal user-supplied inputs. Spark has emerged as the tool of choice for efficient big data analysis. However, users still have to learn the complicated Spark API in order to write even a simple application. Our tool is easy to use, interactive and offers Spark’s native Java API performance. To the best of our knowledge and until the time of this writing, such a tool has not been yet implemented

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

    Proceedings of the Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016) Sofia, Bulgaria

    Get PDF
    Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016). Sofia (Bulgaria), October, 6-7, 2016

    Geometrical Calibration and Filter Optimization for Cone-Beam Computed Tomography

    Get PDF
    This thesis will discuss the requirements of a software library for tomography and will derive a framework which can be used to realize various applications in cone-beam computed tomography (CBCT). The presented framework is self-contained and is realized using the MATLAB environment in combination with native low-level technologies (C/C++ and CUDA) to improve its computational performance, while providing accessibility and extendability through to use of a scripting language environment. On top of this framework, the realization of Katsevich’s algorithm on multicore hardware will be explained and the resulting implementation will be compared to the Feldkamp, Davis and Kress (FDK) algorithm. It will also be shown that this helical reconstruction method has the potential to reduce the measurement uncertainty. However, misalignment artifacts appear more severe in the helical reconstructions from real data than in the circular ones. Especially for helical CBCT (H-CBCT), this fact suggests that a precise calibration of the computed tomography (CT) system is inevitable. As a consequence, a self-calibration method will be designed that is able to estimate the misalignment parameters from the cone-beam projection data without the need of any additional measurements. The presented method employs a multi-resolution 2D-3D registration technique and a novel volume update scheme in combination with a stochastic reprojection strategy to achieve a reasonable runtime performance. The presented results will show that this method reaches sub-voxel accuracy and can compete with current state-of-the-art online- and offline-calibration approaches. Additionally, for the construction of filters in the area of limited-angle tomography a general scheme which uses the Approximate Inverse (AI) to compute an optimized set of 2D angle-dependent projection filters will be derived. Optimal sets of filters are then precomputed for two angular range setups and will be reused to perform various evaluations on multiple datasets with a filtered backprojection (FBP)-type method. This approach will be compared to the standard FDK algorithm and to the simultaneous iterative reconstruction technique (SIRT). The results of the study show that the introduced filter optimization produces results comparable to those of SIRT with respect to the reduction of reconstruction artifacts, whereby its runtime is comparable to that of the FDK algorithm

    Auto-generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA

    No full text