452 research outputs found
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to
present their current research, and to discuss topics with other students in order to look for synergies and common research
topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to
achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable
solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big
data management, training, contributing to glue disparate researchers working across different areas and provide a meeting
ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in
research topics such as sustainable software solutions (applications and system software stack), data management, energy
efficiency, and resilience.European Cooperation in Science and Technology. COS
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
Generalized database index structures on massively parallel processor architectures
Height-balanced search trees are ubiquitous in database management systems as well as in other applications that require efficient access methods in order to identify entries in large data volumes. They can be configured with various strategies for structuring the search space for a given data set and for pruning it when different kinds of search queries are answered. In order to facilitate the development of application-specific tree variants, index frameworks, such as GiST, exist that provide a reusable library of commonly shared tree management functionality. By specializing internal data organization strategies, the framework can be customized to create an index that is efficient for an application's data access characteristics. Because the majority of the framework's code can be reused development and testing efforts are significantly lower, compared to an implementation from scratch. However, none of the existing frameworks supports the execution of index operations on massively parallel processor architectures, such as GPUs. Enabling the use of such processors for generalized index frameworks is the goal of this thesis. By compiling state-of-the-art techniques from a wide range of CPU- and GPU-optimized indexes, a GiST extension is developed that abstracts the physical execution aspect of generic, tree-based search queries. Tree traversals are broken-down into vectorized processing primitives that can be scheduled to one of the available (co-)processors for execution. Further, a CPU-based implementation is provided as well as a new GPU-based algorithm that, unlike prior art in this area, does not require that the index is fully stored inside a GPU's main memory buffer. The applicability of the extended framework is assessed for image rendering engines and, based on microbenchmarks, the parallelized algorithm performance is compared for different CPU and GPU generations. It will be shown that cases exist, where the GPU clearly outperforms the CPU and vice versa. In order to leverage the strengths of each processor type, an adaptive scheduler is presented that can be calibrated to schedule index operations to the best-fitting device in a hybrid system. With the help of a tree traversal simulation different scheduling strategies are evaluated and it will be shown that the adaptive scheduler can be used to make near-optimal decisions.Suchbäume sind allgegenwärtig in Datenbanksystemen und anderen Anwendungen, die eine effiziente Möglichkeit benötigen um in großen Datensätzen nach Einträgen zu suchen, die bestimmte Suchkriterien erfüllen. Sie können mit verschiedenen Strategien konfiguriert werden um den Suchraum zu strukturieren und die für ein Suchergebnis irrelevante Bereiche von der Bearbeitung auszuschließen. Die Entwicklung von anwendungsspezifischen Indexen wird durch Frameworks wie GiST unterstützt. Jedoch unterstützt keines der heute bereits existierenden Frameworks die Verwendung von hochgradig parallelen Prozessorarchitekturen wie GPUs. Solche Prozessoren für generische Index Frameworks nutzbar zu machen, ist Ziel dieser Arbeit. Dazu werden Techniken aus verschiedensten CPU- und GPU-optimierten Indexen analysiert und für die Entwicklung einer GiST-Erweiterung verwendet, welche die für eine Suche in Suchbäumen nötigen Berechnungen abstrahiert. Traversierungsoperationen werden dabei auf vektorisierte Primitive abgebildet, die auf parallelen Prozessoren implementiert werden können. Die Verwendung dieser Erweiterung wird beispielhaft an einem CPU Algorithmus demonstriert. Weiterhin wird ein neuer GPU-basierter Algorithmus vorgestellt, der im Vergleich zu bisherigen Verfahren, ein dynamisches Nachladen der Index Daten in den Hauptspeicher der GPU unterstützt. Die Praktikabilität des erweiterten Frameworks wird am Beispiel von Anwendungen aus der Computergrafik untersucht und die Performanz der verwendeten Algorithmen mit Hilfe eines Benchmarks auf verschiedenen CPU- und GPU-Modellen analysiert. Dabei wird gezeigt, unter welchen Bedingungen die parallele GPU-basierte Ausführung schneller ist als die CPU-basierte Variante - und umgekehrt. Um die Stärken beider Prozessortypen in einem hybriden System ausnutzen zu können, wird ein Scheduler entwickelt, der nach einer Kalibrierungsphase für eine gegebene Operation den geeignetsten Prozessor wählen kann. Mit Hilfe eines Simulators für Baumtraversierungen werden verschiedenste Scheduling Strategien verglichen. Dabei wird gezeigt, dass die Entscheidungen des Schedulers kaum vom Optimum abweichen und, abhängig von der simulierten Last, die erzielbaren Durchsätze für die parallele Ausführung mehrerer Suchoperationen durch hybrides Scheduling um eine Größenordnung und mehr erhöht werden können
AceleraciĂłn de algoritmos de procesamiento de imágenes para el análisis de partĂculas individuales con microscopia electrĂłnica
Tesis Doctoral inĂ©dita cotutelada por la Masaryk University (RepĂşblica Checa) y la Universidad AutĂłnoma de Madrid, Escuela PolitĂ©cnica Superior, Departamento de IngenierĂa Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray
crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and
other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM
is the computational complexity. Modern electron microscopes can produce terabytes of data per single
session, from which hundreds of thousands of particles must be extracted and processed to obtain a
near-atomic resolution of the original sample. Many existing software solutions use high-Performance
Computing (HPC) techniques to bring these computations to the realm of practical usability. The
common approach to acceleration is parallelization of the processing, but in praxis, we face many
complications, such as problem decomposition, data distribution, load scheduling, balancing, and
synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous
hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization,
and sub-optimal code performance due to missing specialization.
This dissertation, structured as a compendium of articles, aims to improve the algorithms used
in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance
optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous
computing or autotuning, which potentially needs the formulation of novel algorithms. The
secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since
the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no
single bottleneck to be solved. As such, the presented articles show a holistic approach to performance
optimization.
First, we give details on the GPU acceleration of the specific programs. The achieved speedup is
due to the higher performance of the GPU, adjustments of the original algorithm to it, and application
of the novel algorithms. More specifically, we provide implementation details of programs for movie
alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude
compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition
to these three programs, multiple other programs from an actively used, open-source software package
XMIPP have been accelerated and improved.
Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of
software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we
present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the
cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set
of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA,
together with the introduction of complex dynamic autotuning to the KTT tool.
Third, we propose an image processing framework Umpalumpa, which combines a task-based
runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for
writing complex workflows which automatically use available HW resources and adjust to different HW
and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa”
Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021.
This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under the Marie Skłodowska-Curie grant agreement No. 71367
- …