193 research outputs found
Tools for efficient Deep Learning
In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption.
We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work.
This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C.
Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets.
All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
Automated cache optimisations of stencil computations for partial differential equations
This thesis focuses on numerical methods that solve partial differential equations.
Our focal point is the finite difference method, which solves partial
differential equations by approximating derivatives with explicit finite differences.
These partial differential equation solvers consist of stencil computations on structured grids.
Stencils for computing real-world practical applications are patterns often
characterised by many memory accesses and non-trivial arithmetic expressions
that lead to high computational costs compared to simple stencils used in much prior
proof-of-concept work.
In addition, the loop nests to express stencils on structured grids may often be complicated.
This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations.
These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}.
In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests.
In this work, we aim to increase the performance of stencil kernel execution.
We study automated cache-memory-dependent optimisations for stencil computations.
This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest.
Data movement is a dominant factor affecting the performance of high-performance computing applications.
It has long been a target of optimisations due to its impact on execution time and energy consumption.
This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations.
Temporal blocking is a well-known technique to enhance data reuse in stencil computations.
However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy.
Applying temporal blocking to scientific simulations is more complex.
More specifically, in this work, we focus on the application context of seismic and medical imaging.
In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain.
These operations make the application of temporal blocking challenging.
We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations.
Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it.
We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking.
These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning.
\href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions.
Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof.
We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution.
We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction.
These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces
The inherent overlapping in the parallel calculation of the Laplacian
Producción CientÃficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla León (grant number VA296P18
Task-based Runtime Optimizations Towards High Performance Computing Applications
The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity.
In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications
Advances in Grid Computing
This book approaches the grid computing with a perspective on the latest achievements in the field, providing an insight into the current research trends and advances, and presenting a large range of innovative research papers. The topics covered in this book include resource and data management, grid architectures and development, and grid-enabled applications. New ideas employing heuristic methods from swarm intelligence or genetic algorithm and quantum encryption are considered in order to explain two main aspects of grid computing: resource management and data management. The book addresses also some aspects of grid computing that regard architecture and development, and includes a diverse range of applications for grid computing, including possible human grid computing system, simulation of the fusion reaction, ubiquitous healthcare service provisioning and complex water systems
GVSU Undergraduate and Graduate Catalog, 2021-2022
Grand Valley State University 2021-2022 undergraduate and graduate course catalog. Course catalogs are published annually to provide students with information and guidance for enrollment.https://scholarworks.gvsu.edu/course_catalogs/1096/thumbnail.jp
An MPI-based 2D data redistribution library for dense arrays
In HPC, data redistributions (reorganizations) are used in parallel applications to improve performance and/or provide data-locality compatibility with sequences of parallel operations. Data reorganization refers to changing the logical and physical arrangement of data (such as dense arrays or matrices distributed over peer processes in a parallel program). This operation can be achieved by applying transformations such as transpositions or rotations or changing how data is mapped across the process grid P by Q, all of which are accomplished either with message passing or distributed shared memory operations. In this project, we restrict ourselves to a distributed memory model and message passing, not a shared-memory model, nor do we use distributed shared memory APIs. Our primary goal is to generate a library capable of diverse data reorganizations. We aim to develop a high-level Application Programming Interface (API) that directly works with the Message Passing Interface (MPI) to accomplish data redistributions in data-parallel applications and libraries, such as the polymath library, a library of many algorithms all of which implement parallel dense matrix multiplication algorithms with flexible data layouts. Using the reorganization mode of process grid shapes with constant total processes, we plan to observe the performance trends of the polymath dense parallel matrix-multiplication library (which is related research) based on different grid shapes, problem sizes, numbers of processes and decide whether it is more efficient to redistribute data or not to achieve the highest performance. We will test other redistributions for some of the process shapes used with the polymath library to identify how redistribution impacts performance. These tests will provide us with the information to determine if redistribution is worthwhile for non-optimal process layout (as established with the polymath system). Besides changing the shape of the data in terms of its layout on a logical process topology, we also plan to study and demonstrate data transpose algorithms in this library, another useful redistribution mechanism for dense linear algebra in distributed-memory parallel computing
Polyhedral+Dataflow Graphs
This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains.
The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods
- …