34 research outputs found
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
Fusion OLAP : Fusing the Pros of MOLAP and ROLAP Together for In-memory OLAP
OLAP models can be categorized with two types: MOLAP (multidimensional OLAP) and ROLAP (relational OLAP). In particular, MOLAP is efficient in multidimensional computing at the cost of cube maintenance, while ROLAP reduces the data storage size at the cost of expensive multidimensional join operations. In this paper, we propose a novel Fusion OLAP model to fuse the multidimensional computing model and relational storage model together to make the best aspects of both MOLAP and ROLAP worlds. This is achieved by mapping the relation tables into virtual multidimensional model and binding the multidimensional operations into a set of vector indexes to enable multidimensional computing on relation tables. The Fusion OLAP model can be integrated into the state-of-the-art in-memory databases with additional surrogate key indexes and vector indexes. We compared the Fusion OLAP implementations with three leading analytical in-memory databases. Our comprehensive experimental results show that Fusion OLAP implementation can achieve up to 35, 365, and 169 percent performance improvements based on the Hyper, Vectorwise, and MonetDB databases, respectively, for the Star Schema Benchmark (SSB) with scale factor 100.Peer reviewe
Efficient computation of the matrix square root in heterogeneous platforms
Dissertação de mestrado em Engenharia InformáticaMatrix algorithms often deal with large amounts of data at a time, which impairs efficient
cache memory usage. Recent collaborative work between the Numerical Algorithms
Group and the University of Minho led to a blocked approach to the matrix square root algorithm
with significant efficiency improvements, particularly in a multicore shared memory
environment.
Distributed memory architectures were left unexplored. In these systems data is distributed
across multiple memory spaces, including those associated with specialized accelerator
devices, such as GPUs. Systems with these devices are known as heterogeneous
platforms.
This dissertation focuses on studying the blocked matrix square root algorithm, first
in a multicore environment, and then in heterogeneous platforms. Two types of hardware
accelerators are explored: Intel Xeon Phi coprocessors and NVIDIA CUDA-enabled GPUs.
The initial implementation confirmed the advantages of the blocked method and showed
excellent scalability in a multicore environment. The same implementation was also used in
the Intel Xeon Phi, but the obtained performance results lagged behind the expected behaviour
and the CPU-only alternative. Several optimizations techniques were applied to the
common implementation, which managed to reduce the gap between the two environments.
The implementation for CUDA-enabled devices followed a different programming model
and was not able to benefit from any of the previous solutions. It also required the implementation
of BLAS and LAPACK routines, since no existing package fits the requirements of
this application. The measured performance also showed that the CPU-only implementation
is still the fastest.Algoritmos de matrizes lidam regularmente com grandes quantidades de dados ao
mesmo tempo, o que dificulta uma utilização eficiente da cache. Um trabalho recente de
colaboração entre o Numerical Algorithms Group e a Universidade do Minho levou a uma
abordagem por blocos para o algoritmo da raíz quadrada de uma matriz com melhorias de
eficiência significativas, particularmente num ambiente multicore de memória partilhada.
Arquiteturas de memória distribuída permaneceram inexploradas. Nestes sistemas
os dados são distribuídos por diversos espaços de memória, incluindo aqueles associados a
dispositivos aceleradores especializados, como GPUs. Sistemas com estes dispositivos são
conhecidos como plataformas heterogéneas.
Esta dissertação foca-se em estudar o algoritmo da raíz quadrada de uma matriz por
blocos, primeiro num ambiente multicore e depois usando plataformas heterogéneas. Dois
tipos de aceleradores são explorados: co-processadores Intel Xeon Phi e GPUs NVIDIA
habilitados para CUDA.
A implementação inicial confirmou as vantagens do método por blocos e mostrou uma
escalabilidade excelente num ambiente multicore. A mesma implementação foi ainda usada
para o Intel Xeon Phi, mas os resultados de performance obtidos ficaram aquém do comportamento
esperado e da alternativa usando apenas CPUs. Várias otimizações foram aplicadas
a esta implementação comum, conseguindo reduzir a diferença entre os dois ambientes.
A implementação para dispositivos CUDA seguiu um modelo de programação diferente
e não pôde beneficiar the nenhuma das soluções anteriores. Também exigiu a implementação
de rotinas BLAS e LAPACK, já que nenhum dos pacotes existentes se adequa aos requisitos
desta implementação. A performance medida também mostrou que a alternativa usando
apenas CPUs ainda é a mais rápida.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga
Interactive High Performance Volume Rendering
This thesis is about Direct Volume Rendering on high performance computing systems. As direct rendering methods do not create a lower-dimensional geometric representation, the whole scientific dataset must be kept in memory. Thus, this family of algorithms has a tremendous resource demand. Direct Volume Rendering algorithms in general are well suited to be implemented for dedicated graphics
hardware. Nevertheless, high performance computing systems often do not provide resources for hardware accelerated rendering, so that the visualization algorithm must be implemented for the available general-purpose hardware.
Ever growing datasets that imply copying large amounts of data from the compute system to the workstation of the scientist, and the need to review intermediate simulation results, make porting Direct Volume Rendering to high performance computing systems highly relevant. The contribution of this thesis is twofold.
As part of the first contribution, after devising a software architecture for general implementations of Direct Volume Rendering on highly parallel platforms, parallelization issues and implementation details for various modern architectures are discussed. The contribution results in a highly parallel implementation that tackles several platforms.
The second contribution is concerned with the display phase of the “Distributed Volume Rendering Pipeline”. Rendering on a high performance computing system typically implies displaying the rendered result at a remote location. This thesis presents a remote rendering technique that is capable of hiding latency and can thus be used in an interactive environment
High-performance generalized tensor operations: A compiler-oriented approach
The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available. © 2018 Association for Computing Machinery