36 research outputs found
AceleraciĂłn de algoritmos de procesamiento de imágenes para el análisis de partĂculas individuales con microscopia electrĂłnica
Tesis Doctoral inĂ©dita cotutelada por la Masaryk University (RepĂşblica Checa) y la Universidad AutĂłnoma de Madrid, Escuela PolitĂ©cnica Superior, Departamento de IngenierĂa Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray
crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and
other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM
is the computational complexity. Modern electron microscopes can produce terabytes of data per single
session, from which hundreds of thousands of particles must be extracted and processed to obtain a
near-atomic resolution of the original sample. Many existing software solutions use high-Performance
Computing (HPC) techniques to bring these computations to the realm of practical usability. The
common approach to acceleration is parallelization of the processing, but in praxis, we face many
complications, such as problem decomposition, data distribution, load scheduling, balancing, and
synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous
hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization,
and sub-optimal code performance due to missing specialization.
This dissertation, structured as a compendium of articles, aims to improve the algorithms used
in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance
optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous
computing or autotuning, which potentially needs the formulation of novel algorithms. The
secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since
the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no
single bottleneck to be solved. As such, the presented articles show a holistic approach to performance
optimization.
First, we give details on the GPU acceleration of the specific programs. The achieved speedup is
due to the higher performance of the GPU, adjustments of the original algorithm to it, and application
of the novel algorithms. More specifically, we provide implementation details of programs for movie
alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude
compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition
to these three programs, multiple other programs from an actively used, open-source software package
XMIPP have been accelerated and improved.
Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of
software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we
present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the
cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set
of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA,
together with the introduction of complex dynamic autotuning to the KTT tool.
Third, we propose an image processing framework Umpalumpa, which combines a task-based
runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for
writing complex workflows which automatically use available HW resources and adjust to different HW
and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa”
Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021.
This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under the Marie Skłodowska-Curie grant agreement No. 71367
Portable performance on heterogeneous architectures
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.
To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997
Benchmarking optimization algorithms for auto-tuning GPU kernels
Recent years have witnessed phenomenal growth in the application, and
capabilities of Graphical Processing Units (GPUs) due to their high parallel
computation power at relatively low cost. However, writing a computationally
efficient GPU program (kernel) is challenging, and generally only certain
specific kernel configurations lead to significant increases in performance.
Auto-tuning is the process of automatically optimizing software for
highly-efficient execution on a target hardware platform. Auto-tuning is
particularly useful for GPU programming, as a single kernel requires re-tuning
after code changes, for different input data, and for different architectures.
However, the discrete, and non-convex nature of the search space creates a
challenging optimization problem. In this work, we investigate which algorithm
produces the fastest kernels if the time-budget for the tuning task is varied.
We conduct a survey by performing experiments on 26 different kernel spaces,
from 9 different GPUs, for 16 different evolutionary black-box optimization
algorithms. We then analyze these results and introduce a novel metric based on
the PageRank centrality concept as a tool for gaining insight into the
difficulty of the optimization problem. We demonstrate that our metric
correlates strongly with observed tuning performance.Comment: in IEEE Transactions on Evolutionary Computation, 202
Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations
Initially driven by a strong need for increased computational performance in science and
engineering, heterogeneous systems have become ubiquitous and they are getting increasingly
complex. The single processor era has been replaced with multi-core processors,
which have quickly been surrounded by satellite devices aiming to increase the throughput
of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable
Gate Arrays or other specialized processors have very different architectures.
This puts an enormous strain on programming models and software developers to take full
advantage of the computing power at hand. Because of this diversity and the unachievable
flexibility and portability necessary to optimize for each target individually, heterogeneous
systems remain typically vastly under-utilized.
In this thesis, we explore two distinct ways to tackle this problem. Providing automated,
non intrusive methods in the form of compiler tools and implementing efficient abstractions
to automatically tune parameters for a restricted domain are two complementary
approaches investigated to better utilize compute resources in heterogeneous systems.
First, we explore a fully automated compiler based approach, where a runtime system
analyzes the computation flow of an OpenCL application and optimizes it across multiple
compute kernels. This method can be deployed on any existing application transparently
and replaces significant software engineering effort spent to tune application for a particular
system. We show that this technique achieves speedups of up to 3x over unoptimized
code and an average of 1.4x over manually optimized code for highly dynamic applications.
Second, a library based approach is designed to provide a high level abstraction for
complex problems in a specific domain, stencil computation. Using domain specific techniques,
the underlying framework optimizes the code aggressively. We show that even in
a restricted domain, automatic tuning mechanisms and robust architectural abstraction are
necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling
of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs
and 3.6x on four