575 research outputs found

    Density Estimations for Approximate Query Processing on SIMD Architectures

    Full text link
    Approximate query processing (AQP) is an interesting alternative for exact query processing. It is a tool for dealing with the huge data volumes where response time is more important than perfect accuracy (this is typically the case during initial phase of data exploration). There are many techniques for AQP, one of them is based on probability density functions (PDF). PDFs are typically calculated using nonparametric data-driven methods. One of the most popular nonparametric method is the kernel density estimator (KDE). However, a very serious drawback of using KDEs is the large number of calculations required to compute them. The shape of final density function is very sensitive to an entity called bandwidth or smoothing parameter. Calculating it's optimal value is not a trivial task and in general is very time consuming. In this paper we investigate the possibility of utilizing two SIMD architectures: SSE CPU extensions and NVIDIA's CUDA architecture to accelerate finding of the bandwidth. Our experiments show orders of magnitude improvements over a simple sequential implementation of classical algorithms used for that task

    Data-Parallel Hashing Techniques for GPU Architectures

    Full text link
    Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. Key factors affecting the performance of different hashing schemes are discovered and used to suggest best practices and pinpoint areas for further research

    Dynamic autotuning of adaptive fast multipole methods on hybrid multicore CPU & GPU systems

    Full text link
    We discuss an implementation of adaptive fast multipole methods targeting hybrid multicore CPU- and GPU-systems. From previous experiences with the computational profile of our version of the fast multipole algorithm, suitable parts are off-loaded to the GPU, while the remaining parts are threaded and executed concurrently by the CPU. The parameters defining the algorithm affects the performance and by measuring this effect we are able to dynamically balance the algorithm towards optimal performance. Our setup uses the dynamic nature of the computations and is therefore of general character

    Data Protection: Combining Fragmentation, Encryption, and Dispersion, a final report

    Full text link
    Hardening data protection using multiple methods rather than 'just' encryption is of paramount importance when considering continuous and powerful attacks in order to observe, steal, alter, or even destroy private and confidential information.Our purpose is to look at cost effective data protection by way of combining fragmentation, encryption, and dispersion over several physical machines. This involves deriving general schemes to protect data everywhere throughout a network of machines where they are being processed, transmitted, and stored during their entire life cycle. This is being enabled by a number of parallel and distributed architectures using various set of cores or machines ranging from General Purpose GPUs to multiple clouds. In this report, we first present a general and conceptual description of what should be a fragmentation, encryption, and dispersion system (FEDS) including a number of high level requirements such systems ought to meet. Then, we focus on two kind of fragmentation. First, a selective separation of information in two fragments a public one and a private one. We describe a family of processes and address not only the question of performance but also the questions of memory occupation, integrity or quality of the restitution of the information, and of course we conclude with an analysis of the level of security provided by our algorithms. Then, we analyze works first on general dispersion systems in a bit wise manner without data structure consideration; second on fragmentation of information considering data defined along an object oriented data structure or along a record structure to be stored in a relational database

    Performance report and optimized implementations of Weather & Climate dwarfs on multi-node systems

    Full text link
    This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns in terms of computation and communication (in the spirit of the Berkeley dwarfs). These dwarfs are then optimised for different hardware architectures (single and multi-node) and alternative algorithms are explored. Performance portability is addressed through the use of domain specific languages. Here we summarize the work performed on optimizations of the dwarfs focusing on CPU multi-nodes and multi-GPUs. We limit ourselves to a subset of the dwarf configurations chosen by the consortium. Intra-node optimizations of the dwarfs and energy-specific optimizations have been described in Deliverable D3.3. To cover the important algorithmic motifs we picked dwarfs related to the dynamical core as well as column physics. Specifically, we focused on the formulation relevant to spectral codes like ECMWF's IFS code. The main findings of this report are: (a) Up-to 30% performance gain with CPU based multi-node systems compared to optimized version of dwarfs from task 3.3 (see D3.3), (b) up to 10X performance gain on multiple GPUs from optimizations to keep data resident on the GPU and enable fast inter-GPU communication mechanisms, and (c) multi-GPU systems which feature a high-bandwidth all-to-all interconnect topology with NVLink/NVSwitch hardware are particularly well suited to the algorithms.Comment: 35 pages, 22 figure

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Acceleration of Computational Geometry Algorithms for High Performance Computing Based Geo-Spatial Big Data Analysis

    Get PDF
    Geo-Spatial computing and data analysis is the branch of computer science that deals with real world location-based data. Computational geometry algorithms are algorithms that process geometry/shapes and is one of the pillars of geo-spatial computing. Real world map and location-based data can be huge in size and the data structures used to process them extremely big leading to huge computational costs. Furthermore, Geo-Spatial datasets are growing on all V’s (Volume, Variety, Value, etc.) and are becoming larger and more complex to process in-turn demanding more computational resources. High Performance Computing is a way to breakdown the problem in ways that it can run in parallel on big computers with massive processing power and hence reduce the computing time delivering the same results but much faster.This dissertation explores different techniques to accelerate the processing of computational geometry algorithms and geo-spatial computing like using Many-core Graphics Processing Units (GPU), Multi-core Central Processing Units (CPU), Multi-node setup with Message Passing Interface (MPI), Cache optimizations, Memory and Communication optimizations, load balancing, Algorithmic Modifications, Directive based parallelization with OpenMP or OpenACC and Vectorization with compiler intrinsic (AVX). This dissertation has applied at least one of the mentioned techniques to the following problems. Novel method to parallelize plane sweep based geometric intersection for GPU with directives is presented. Parallelization of plane sweep based Voronoi construction, parallelization of Segment tree construction, Segment tree queries and Segment tree-based operations has been presented. Spatial autocorrelation, computation of getis-ord hotspots are also presented. Acceleration performance and speedup results are presented in each corresponding chapter

    CUDACLAW: A high-performance programmable GPU framework for the solution of hyperbolic PDEs

    Full text link
    We present cudaclaw, a CUDA-based high performance data-parallel framework for the solution of multidimensional hyperbolic partial differential equation (PDE) systems, equations describing wave motion. cudaclaw allows computational scientists to solve such systems on GPUs without being burdened by the need to write CUDA code, worry about thread and block details, data layout, and data movement between the different levels of the memory hierarchy. The user defines the set of PDEs to be solved via a CUDA- independent serial Riemann solver and the framework takes care of orchestrating the computations and data transfers to maximize arithmetic throughput. cudaclaw treats the different spatial dimensions separately to allow suitable block sizes and dimensions to be used in the different directions, and includes a number of optimizations to minimize access to global memory

    Analysis of heterogeneous computing approaches to simulating heat transfer in heterogeneous material

    Full text link
    The simulation of heat flow through heterogeneous material is important for the design of structural and electronic components. Classical analytical solutions to the heat equation PDE are not known for many such domains, even those having simple geometries. The finite element method can provide approximations to a weak form continuum solution, with increasing accuracy as the number of degrees of freedom in the model increases. This comes at a cost of increased memory usage and computation time; even when taking advantage of sparse matrix techniques for the finite element system matrix. We summarize recent approaches in solving problems in structural mechanics and steady state heat conduction which do not require the explicit assembly of any system matrices, and adapt them to a method for solving the time-depended flow of heat. These approaches are highly parallelizable, and can be performed on graphical processing units (GPUs). Furthermore, they lend themselves to the simulation of heterogeneous material, with a minimum of added complexity. We present the mathematical framework of assembly-free FEM approaches, through which we summarize the benefits of GPU computation. We discuss our implementation using the OpenCL computing framework, and show how it is further adapted for use on multiple GPUs. We compare the performance of single and dual GPUs implementations of our method with previous GPU computing strategies from the literature and a CPU sparse matrix approach. The utility of the novel method is demonstrated through the solution of a real-world coefficient inverse problem that requires thousands of transient heat flow simulations, each of which involves solving a 1 million degree of freedom linear system over hundreds of time steps

    A GPGPU based program to solve the TDSE in intense laser fields through the finite difference approach

    Full text link
    We present a General-purpose computing on graphics processing units (GPGPU) based computational program and framework for the electronic dynamics of atomic systems under intense laser fields. We present our results using the case of hydrogen, however the code is trivially extensible to tackle problems within the single-active electron (SAE) approximation. Building on our previous work, we introduce the first available GPGPU based implementation of the Taylor, Runge-Kutta and Lanczos based methods created with strong field ab-initio simulations specifically in mind; CLTDSE. The code makes use of finite difference methods and the OpenCL framework for GPU acceleration. The specific example system used is the classic test system; Hydrogen. After introducing the standard theory, and specific quantities which are calculated, the code, including installation and usage, is discussed in-depth. This is followed by some examples and a short benchmark between an 8 hardware thread (i.e logical core) Intel Xeon CPU and an AMD 6970 GPU, where the parallel algorithm runs 10 times faster on the GPU than the CPU.Comment: 12 figure
    • …
    corecore