171 research outputs found
A Distributed Algebra System for Time Integration on Parallel Computers
We present a distributed algebra system for efficient and compact
implementation of numerical time integration schemes on parallel computers and
graphics processing units (GPU). The software implementation combines the time
integration library Odeint from Boost with the OpenFPM framework for scalable
scientific computing. Implementing multi-stage, multi-step, or adaptive time
integration methods in distributed-memory parallel codes or on GPUs is
challenging. The present algebra system addresses this by making the time
integration methods from Odeint available in a concise template-expression
language for numerical simulations distributed and parallelized using OpenFPM.
This allows using state-of-the-art time integration schemes, or switching
between schemes, by changing one line of code, while maintaining parallel
scalability. This enables scalable time integration with compact code and
facilitates rapid rewriting and deployment of simulation algorithms. We
benchmark the present software for exponential and sigmoidal dynamics and
present an application example to the 3D Gray-Scott reaction-diffusion problem
on both CPUs and GPUs in only 60 lines of code
Integrating Bayesian Optimization and Machine Learning for the Optimal Configuration of Cloud Systems
Bayesian Optimization (BO) is an efficient method for finding optimal cloud configurations for several types of applications. On the other hand, Machine Learning (ML) can provide helpful knowledge about the application at hand thanks to its predicting capabilities. This work proposes a general approach based on BO, which integrates elements from ML techniques in multiple ways, to find an optimal configuration of recurring jobs running in public and private cloud environments, possibly subject to blackbox constraints, e.g., application execution time or accuracy. We test our approach by considering several use cases, including edge computing, scientific computing, and Big Data applications. Results show that our solution outperforms other state-of-the-art black-box techniques, including classical autotuning and BO- and ML-based algorithms, reducing the number of unfeasible executions and corresponding costs up to 2–4 times
Block Fusion on Dynamically Adaptive Spacetree Grids for Shallow Water Waves
Spacetrees are a popular formalism to describe dynamically adaptive Cartesian grids. Even though they directly yield a mesh, it is often computationally reasonable to embed regular Cartesian blocks into their leaves. This promotes stencils working on homogeneous data chunks. The choice of a proper block size is sensitive. While large block sizes foster loop parallelism and vectorisation, they restrict the adaptivity's granularity and hence increase the memory footprint and lower the numerical accuracy per byte. In the present paper, we therefore use a multiscale spacetree-block coupling admitting blocks on all spacetree nodes. We propose to find sets of blocks on the finest scale throughout the simulation and to replace them by fused big blocks. Such a replacement strategy can pick up hardware characteristics, i.e. which block size yields the highest throughput, while the dynamic adaptivity of the fine grid mesh is not constrained—applications can work with fine granular blocks. We study the fusion with a state-of-the-art shallow water solver at hands of an Intel Sandy Bridge and a Xeon Phi processor where we anticipate their reaction to selected block optimisation and vectorisation
Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources
Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x
A CUDA backend for Marrow and its Optimisation via Machine Learning
In the modern days, various industries like business and science deal with collecting, processing
and storing massive amounts of data. Conventional CPUs, which are optimised
for sequential performance, struggle to keep up with processing so much data, however
GPUs, designed for parallel computations, are more than up for the task.
Using GPUs for general processing has become more popular in recent years due
to the need for fast parallel processing, but developing programs that execute on the
GPU can be difficult and time consuming. Various high-level APIs that compile into
GPU programs exist, however due to the abstraction of lower level concepts and lack
of algorithm specific optimisations, it may not be possible to reach peak performance.
Optimisation specifically is an interesting problem, optimisation patterns very rarely can
be applied uniformly to different algorithms and manually tuning individual programs
is extremely time consuming.
Machine learning compilation is a concept that has gained some attention in recent
years, with good reason. The idea is to have a model trained using a machine learning
algorithm and have it make an estimate on how to optimise an input program. Predicting
the best optimisations for a program is much faster than doing it manually, in works
making use of this technique, it has shown to also provide even better optimisations.
In this thesis, we will be working with the Marrow framework and develop a CUDA
based backend for it, so that low-level GPU code may be generated. Additionally, we will
be training a machine learning model and use it to automatically optimise the CUDA
code generated from Marrow programs.Hoje em dia, várias indústrias como negócios e ciência lidam com a coleção, processamento
e armazenamento de enormes quantidades de dados. CPUs convencionais, que
são otimizados para processarem sequencialmente, têm dificuldade a processar tantos
dados eficientemente, no entanto, GPUs que são desenhados para efetuarem computações
paralelas, são mais que adequados para a tarefa.
Usar GPUs para computações genéricas tem-se tornado mais comum em anos recentes
devído à necessidade de processamento paralelo rápido, mas desenvolver programas que
executam na GPU pode ser bastante díficil e demorar demasiado tempo. Existem várias
APIs de alto nível que compilem para a GPU, mas devído à abstração de conceitos de
baixo nível e à falta de otimizações específicas para algoritmos, pode ser impossível obter
o máximo de efficiência. É interessante o problema de otimização, pois na maior parte dos
casos é impossível aplicar padróes de otimização uniformemente em diferentes algoritmos
e encontrar a melhor maneira de otimizar um programa manualmente demora bastante
tempo.
Compilação usando aprendizagem automática é um conceito que tem ficado mais popular
em tempos recentes, e por boas razões. A ideia consiste em ter um modelo treinado
através com um algoritmo de aprendizagem automática e usa-lo para ter uma estimativa
das melhor otimizações que se podem aplicar a um dado programa. Prever as melhores
otimizações com um modelo é muito mais rápido que o processo manual, e trabalhos que
usam esta técnica demonstram obter otmizações ainda melhores.
Nesta tese, vamos trabalhar com a framework Marrow e desevolver uma backend de
CUDA para a mesma, de forma a que esta possa gerar código de baixo nível para a GPU.
Para além disso, vamos treinar um modelo de aprendizagem automática e usa-lo para
otimizar código CUDA gerado a partir de programas do Marrow automáticamente
- …