34 research outputs found

    OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF

    Full text link
    The actor model of computation has been designed for a seamless support of concurrency and distribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework (CAF). This offers a high level interface for accessing any OpenCL device without leaving the actor paradigm. The new type of actor is integrated into the runtime environment of CAF and gives rise to transparent message passing in distributed systems on heterogeneous hardware. Following the actor logic in CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence operate in a multi-stage fashion on data resident at the GPU. Developers are thus enabled to build complex data parallel programs from primitives without leaving the actor paradigm, nor sacrificing performance. Our evaluations on commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear scaling behavior when offloading larger workloads. For sub-second duties, the efficiency of offloading was found to largely differ between devices. Moreover, our findings indicate a negligible overhead over programming with the native OpenCL API.Comment: 28 page

    Fusion OLAP : Fusing the Pros of MOLAP and ROLAP Together for In-memory OLAP

    Get PDF
    OLAP models can be categorized with two types: MOLAP (multidimensional OLAP) and ROLAP (relational OLAP). In particular, MOLAP is efficient in multidimensional computing at the cost of cube maintenance, while ROLAP reduces the data storage size at the cost of expensive multidimensional join operations. In this paper, we propose a novel Fusion OLAP model to fuse the multidimensional computing model and relational storage model together to make the best aspects of both MOLAP and ROLAP worlds. This is achieved by mapping the relation tables into virtual multidimensional model and binding the multidimensional operations into a set of vector indexes to enable multidimensional computing on relation tables. The Fusion OLAP model can be integrated into the state-of-the-art in-memory databases with additional surrogate key indexes and vector indexes. We compared the Fusion OLAP implementations with three leading analytical in-memory databases. Our comprehensive experimental results show that Fusion OLAP implementation can achieve up to 35, 365, and 169 percent performance improvements based on the Hyper, Vectorwise, and MonetDB databases, respectively, for the Star Schema Benchmark (SSB) with scale factor 100.Peer reviewe

    Efficient computation of the matrix square root in heterogeneous platforms

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaMatrix algorithms often deal with large amounts of data at a time, which impairs efficient cache memory usage. Recent collaborative work between the Numerical Algorithms Group and the University of Minho led to a blocked approach to the matrix square root algorithm with significant efficiency improvements, particularly in a multicore shared memory environment. Distributed memory architectures were left unexplored. In these systems data is distributed across multiple memory spaces, including those associated with specialized accelerator devices, such as GPUs. Systems with these devices are known as heterogeneous platforms. This dissertation focuses on studying the blocked matrix square root algorithm, first in a multicore environment, and then in heterogeneous platforms. Two types of hardware accelerators are explored: Intel Xeon Phi coprocessors and NVIDIA CUDA-enabled GPUs. The initial implementation confirmed the advantages of the blocked method and showed excellent scalability in a multicore environment. The same implementation was also used in the Intel Xeon Phi, but the obtained performance results lagged behind the expected behaviour and the CPU-only alternative. Several optimizations techniques were applied to the common implementation, which managed to reduce the gap between the two environments. The implementation for CUDA-enabled devices followed a different programming model and was not able to benefit from any of the previous solutions. It also required the implementation of BLAS and LAPACK routines, since no existing package fits the requirements of this application. The measured performance also showed that the CPU-only implementation is still the fastest.Algoritmos de matrizes lidam regularmente com grandes quantidades de dados ao mesmo tempo, o que dificulta uma utilização eficiente da cache. Um trabalho recente de colaboração entre o Numerical Algorithms Group e a Universidade do Minho levou a uma abordagem por blocos para o algoritmo da raíz quadrada de uma matriz com melhorias de eficiência significativas, particularmente num ambiente multicore de memória partilhada. Arquiteturas de memória distribuída permaneceram inexploradas. Nestes sistemas os dados são distribuídos por diversos espaços de memória, incluindo aqueles associados a dispositivos aceleradores especializados, como GPUs. Sistemas com estes dispositivos são conhecidos como plataformas heterogéneas. Esta dissertação foca-se em estudar o algoritmo da raíz quadrada de uma matriz por blocos, primeiro num ambiente multicore e depois usando plataformas heterogéneas. Dois tipos de aceleradores são explorados: co-processadores Intel Xeon Phi e GPUs NVIDIA habilitados para CUDA. A implementação inicial confirmou as vantagens do método por blocos e mostrou uma escalabilidade excelente num ambiente multicore. A mesma implementação foi ainda usada para o Intel Xeon Phi, mas os resultados de performance obtidos ficaram aquém do comportamento esperado e da alternativa usando apenas CPUs. Várias otimizações foram aplicadas a esta implementação comum, conseguindo reduzir a diferença entre os dois ambientes. A implementação para dispositivos CUDA seguiu um modelo de programação diferente e não pôde beneficiar the nenhuma das soluções anteriores. Também exigiu a implementação de rotinas BLAS e LAPACK, já que nenhum dos pacotes existentes se adequa aos requisitos desta implementação. A performance medida também mostrou que a alternativa usando apenas CPUs ainda é a mais rápida.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga

    Interactive High Performance Volume Rendering

    Get PDF
    This thesis is about Direct Volume Rendering on high performance computing systems. As direct rendering methods do not create a lower-dimensional geometric representation, the whole scientific dataset must be kept in memory. Thus, this family of algorithms has a tremendous resource demand. Direct Volume Rendering algorithms in general are well suited to be implemented for dedicated graphics hardware. Nevertheless, high performance computing systems often do not provide resources for hardware accelerated rendering, so that the visualization algorithm must be implemented for the available general-purpose hardware. Ever growing datasets that imply copying large amounts of data from the compute system to the workstation of the scientist, and the need to review intermediate simulation results, make porting Direct Volume Rendering to high performance computing systems highly relevant. The contribution of this thesis is twofold. As part of the first contribution, after devising a software architecture for general implementations of Direct Volume Rendering on highly parallel platforms, parallelization issues and implementation details for various modern architectures are discussed. The contribution results in a highly parallel implementation that tackles several platforms. The second contribution is concerned with the display phase of the “Distributed Volume Rendering Pipeline”. Rendering on a high performance computing system typically implies displaying the rendered result at a remote location. This thesis presents a remote rendering technique that is capable of hiding latency and can thus be used in an interactive environment

    High-performance generalized tensor operations: A compiler-oriented approach

    Full text link
    The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available. © 2018 Association for Computing Machinery
    corecore