14 research outputs found
Heterogeneity-aware scheduling and data partitioning for system performance acceleration
Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity.
Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity.
This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster.
Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and
Computer Science PhD funding from University of St Andrews; by UK
EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore
Systems (EP/P020631/1)." -- Acknowledgement
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
The High Performance Computing (HPC) community recognizes energy
consumption as a major problem. Extensive research is underway to
identify means to increase energy efficiency of HPC systems
including consideration of alternative
building blocks for future systems. This thesis considers one
such system, the Texas Instruments Keystone II, a heterogeneous
Low-Power System-on-Chip (LPSoC) processor that combines a quad
core ARM CPU with an octa-core Digital Signal Processor (DSP). It
was first released in 2012.
Four issues are considered: i) maximizing the Keystone II ARM CPU
performance; ii) implementation and extension of the OpenMP
programming model for the Keystone II; iii) simultaneous use of
ARM and DSP cores across multiple Keystone SoCs; and iv) an
energy model for applications running on LPSoCs like the Keystone
II and heterogeneous systems in general.
Maximizing the performance of the ARM CPU on the Keystone II
system is fundamental to adoption of this system by the HPC
community and, of the ARM architecture more broadly. Key to
achieving good performance is exploitation of the ARM vector
instructions. This thesis presents the first detailed comparison
of the use of ARM compiler intrinsic functions with automatic
compiler vectorization across four generations of ARM processors.
Comparisons are also made with x86 based platforms and the use of
equivalent Intel vector instructions.
Implementation of the OpenMP programming model on the Keystone II
system presents both challenges and opportunities. Challenges in
that the OpenMP model was originally developed for a homogeneous
programming environment with a common instruction set
architecture, and in 2012 work had only just begun to consider
how OpenMP might work with accelerators. Opportunities in that
shared memory is accessible to all processing elements on the
LPSoC, offering performance advantages over what typically exists
with attached accelerators. This thesis presents an analysis of a
prototype version of OpenMP implemented as a bare-metal runtime
on the DSP of a Keystone I system. An implementation for the
Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL
runtime library operations is presented and evaluated.
Exploitation of some of the underlying hardware features of the
Keystone II is also discussed.
Simultaneous use of the ARM and DSP cores across multiple
Keystone II boards is fundamental to the creation of commercially
viable HPC offerings based on Keystone technology. The nCore
BrownDwarf and HPE Moonshot systems represent two such systems.
This thesis presents a proof-of-concept implementation of matrix
multiplication (GEMM) for the BrownDwarf system. The BrownDwarf
utilizes both Keystone II and Keystone I SoCs through a
point-to-point interconnect called Hyperlink. Details of how a
novel message passing communication framework across Hyperlink
was implemented to support this complex environment are
provided.
An energy model that can be used to predict energy usage as a
function of what fraction of a particular computation is
performed on each of the available compute devices offers the
opportunity for making runtime decisions on how best to minimize
energy usage. This thesis presents a basic energy usage model
that considers rates of executions on each device and their
active and idle power usages. Using this model, it is shown that
only under certain conditions does there exist an energy-optimal
work partition that uses multiple compute devices. To validate
the model a high resolution energy measurement environment is
developed and used to gather energy measurements for a matrix
multiplication benchmark running on a variety of systems. Results
presented support the model.
Drawing on the four issues noted above and other developments
that have occurred since the Keystone II system was first
announced, the thesis concludes by making comments regarding the
future of LPSoCs as building blocks for HPC systems
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much
attention from researchers in fields of multigrid methods and graph analysis.
Many optimization techniques have been developed for certain application fields
and computing architecture over the decades. The objective of this paper is to
provide a structured and comprehensive overview of the research on SpGEMM.
Existing optimization techniques have been grouped into different categories
based on their target problems and architectures. Covered topics include SpGEMM
applications, size prediction of result matrix, matrix partitioning and load
balancing, result accumulating, and target architecture-oriented optimization.
The rationales of different algorithms in each category are analyzed, and a
wide range of SpGEMM algorithms are summarized. This survey sufficiently
reveals the latest progress and research status of SpGEMM optimization from
1977 to 2019. More specifically, an experimentally comparative study of
existing implementations on CPU and GPU is presented. Based on our findings, we
highlight future research directions and how future studies can leverage our
findings to encourage better design and implementation.Comment: 19 pages, 11 figures, 2 tables, 4 algorithm
Enabling the use of embedded and mobile technologies for high-performance computing
In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture.
In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC.
This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment.
Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayorÃa de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las caracterÃsticas que faltan y sugerimos mejoras quepermitirÃan su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido
Enabling high performance dynamic language programming for micro-core architectures
Micro-core architectures are intended to deliver high performance at a low overall power consumption by combining many simple central processing unit (CPU) cores, with an associated small amount of memory, onto a single chip. This technology is not only of great interest for embedded, Edge and IoT applications but also for High-Performance Computing (HPC) accelerators. However, micro-core architectures are difficult to program and exploit, not only because each technology is different, with its own idiosyncrasies, but also because they each present a different low-level interface to the programmer. Furthermore, micro-cores have very constrained amounts of on-chip, scratchpad memory (often around 32KB), further hampering programmer productivity by requiring the programmer to manually manage the regular loading and unloading of data from the host to the device during program execution. To help address these issues, dynamic languages such as Python have been ported to several micro-core architectures but these are often delivered as interpreters with the associated performance penalty over natively compiled languages, such as C.
The research questions for this thesis target four areas of concern for dynamic programming languages on micro-core architectures: (RQ1) how to manage the limited on-chip memory for data, (RQ2) how to manage the limited on-chip memory for code, (RQ3) how to address the low runtime performance of virtual machines and (RQ4) how to manage the idiosyncratic architectures of micro-core architectures. The focus of this work is to address these concerns whilst maintaining the programmer productivity benefits of dynamic programming languages, using ePython as the research vehicle. Therefore, key areas of design (such as abstractions for offload) and implementation (novel compiler and runtime techniques for these technologies) are considered, resulting in a number of approaches that are not only applicable to the compilation of Python codes but also more generally to other dynamic languages on micro-cores architectures.
RQ1 was addressed by providing support for kernels with arbitrary data size through high-level programming abstractions that enable access to the memory hierarchies of micro-core devices, allowing the deployment of real-world applications, such as a machine learning code to detect cancer cells in full-sized scan images. A new abstract machine, Olympus, addressed RQ2 by supporting the compilation of dynamic languages, such as Python, to micro-core native code. Olympus enables ePython to close the kernel runtime performance gap with native C, matching C for the LINPACK and an iterative Fibonacci benchmark, and to provide, on average, around 75\% of native C runtime performance across four benchmarks running on a set of eight CPU architectures. Olympus also addresses RQ3 by providing dynamic function loading, supporting kernel codes larger than the on-chip memory, whilst still retaining the runtime performance benefits of native code generation. Finally, RQ4 was addressed by the Eithne benchmarking framework which not only enabled a single benchmarking code to be deployed, unchanged, across different CPU architectures, but also provided the underlying communications framework for Olympus. The portability of end-user ePython codes and the underlying Olympus abstract machine were validated by running a set of four benchmarks on eight different CPU architectures, from a single codebase
Optimizing Depthwise Separable Convolution Operations on GPUs
The depthwise separable convolution is widely used to reduce the computation overhead of multi-channel 2D convolutions. Existing implementations of depthwise separable convolutions target accelerating model training with large batch size with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of convolution operations to reduce the number of memory operations. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve the GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: NVIDIA RTX 2080Ti and NVIDIA Jetson AGX Xavier GPUs, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2 (up to 3) performance improvement over cuDNN
Dense and sparse parallel linear algebra algorithms on graphics processing units
Una lÃnea de desarrollo seguida en el campo de la supercomputación es el uso de procesadores de propósito especÃfico para acelerar determinados tipos de cálculo. En esta tesis estudiamos el uso de tarjetas gráficas como aceleradores de la computación y lo aplicamos al ámbito del álgebra lineal. En particular trabajamos con la biblioteca SLEPc para resolver problemas de cálculo de autovalores en matrices de gran dimensión, y para aplicar funciones de matrices en los cálculos de aplicaciones cientÃficas. SLEPc es una biblioteca paralela que se basa en el estándar MPI y está desarrollada con la premisa de ser escalable, esto es, de permitir resolver problemas más grandes al aumentar las unidades de procesado.
El problema lineal de autovalores, Ax = lambda x en su forma estándar, lo abordamos con el uso de técnicas iterativas, en concreto con métodos de Krylov, con los que calculamos una pequeña porción del espectro de autovalores. Este tipo de algoritmos se basa en generar un subespacio de tamaño reducido (m) en el que proyectar el problema de gran dimensión (n), siendo m << n. Una vez se ha proyectado el problema, se resuelve este mediante métodos directos, que nos proporcionan aproximaciones a los autovalores del problema inicial que querÃamos resolver. Las operaciones que se utilizan en la expansión del subespacio varÃan en función de si los autovalores deseados están en el exterior o en el interior del espectro. En caso de buscar autovalores en el exterior del espectro, la expansión se hace mediante multiplicaciones matriz-vector. Esta operación la realizamos en la GPU, bien mediante el uso de bibliotecas o mediante la creación de funciones que aprovechan la estructura de la matriz. En caso de autovalores en el interior del espectro, la expansión requiere resolver sistemas de ecuaciones lineales. En esta tesis implementamos varios algoritmos para la resolución de sistemas de ecuaciones lineales para el caso especÃfico de matrices con estructura tridiagonal a bloques, que se ejecutan en GPU.
En el cálculo de las funciones de matrices hemos de diferenciar entre la aplicación directa de una función sobre una matriz, f(A), y la aplicación de la acción de una función de matriz sobre un vector, f(A)b. El primer caso implica un cálculo denso que limita el tamaño del problema. El segundo permite trabajar con matrices dispersas grandes, y para resolverlo también hacemos uso de métodos de Krylov. La expansión del subespacio se hace mediante multiplicaciones matriz-vector, y hacemos uso de GPUs de la misma forma que al resolver autovalores. En este caso el problema proyectado comienza siendo de tamaño m, pero se incrementa en m en cada reinicio del método. La resolución del problema proyectado se hace aplicando una función de matriz de forma directa. Nosotros hemos implementado varios algoritmos para calcular las funciones de matrices raÃz cuadrada y exponencial, en las que el uso de GPUs permite acelerar el cálculo.One line of development followed in the field of supercomputing is the use of specific purpose processors to speed up certain types of computations. In this thesis we study the use of graphics processing units as computer accelerators and apply it to the field of linear algebra. In particular, we work with the SLEPc library to solve large scale eigenvalue problems, and to apply matrix functions in scientific applications. SLEPc is a parallel library based on the MPI standard and is developed with the premise of being scalable, i.e. to allow solving larger problems by increasing the processing units.
We address the linear eigenvalue problem, Ax = lambda x in its standard form, using iterative techniques, in particular with Krylov's methods, with which we calculate a small portion of the eigenvalue spectrum. This type of algorithms is based on generating a subspace of reduced size (m) in which to project the large dimension problem (n), being m << n. Once the problem has been projected, it is solved by direct methods, which provide us with approximations of the eigenvalues of the initial problem we wanted to solve. The operations used in the expansion of the subspace vary depending on whether the desired eigenvalues are from the exterior or from the interior of the spectrum. In the case of searching for exterior eigenvalues, the expansion is done by matrix-vector multiplications. We do this on the GPU, either by using libraries or by creating functions that take advantage of the structure of the matrix. In the case of eigenvalues from the interior of the spectrum, the expansion requires solving linear systems of equations. In this thesis we implemented several algorithms to solve linear systems of equations for the specific case of matrices with a block-tridiagonal structure, that are run on GPU.
In the computation of matrix functions we have to distinguish between the direct application of a matrix function, f(A), and the action of a matrix function on a vector, f(A)b. The first case involves a dense computation that limits the size of the problem. The second allows us to work with large sparse matrices, and to solve it we also make use of Krylov's methods. The expansion of subspace is done by matrix-vector multiplication, and we use GPUs in the same way as when solving eigenvalues. In this case the projected problem starts being of size m, but it is increased by m on each restart of the method. The solution of the projected problem is done by directly applying a matrix function. We have implemented several algorithms to compute the square root and the exponential matrix functions, in which the use of GPUs allows us to speed up the computation.Una lÃnia de desenvolupament seguida en el camp de la supercomputació és l'ús de processadors de propòsit especÃfic per a accelerar determinats tipus de cà lcul. En aquesta tesi estudiem l'ús de targetes grà fiques com a acceleradors de la computació i ho apliquem a l'à mbit de l'à lgebra lineal. En particular treballem amb la biblioteca SLEPc per a resoldre problemes de cà lcul d'autovalors en matrius de gran dimensió, i per a aplicar funcions de matrius en els cà lculs d'aplicacions cientÃfiques. SLEPc és una biblioteca paral·lela que es basa en l'està ndard MPI i està desenvolupada amb la premissa de ser escalable, açò és, de permetre resoldre problemes més grans en augmentar les unitats de processament.
El problema lineal d'autovalors, Ax = lambda x en la seua forma està ndard, ho abordem amb l'ús de tècniques iteratives, en concret amb mètodes de Krylov, amb els quals calculem una xicoteta porció de l'espectre d'autovalors. Aquest tipus d'algorismes es basa a generar un subespai de grandà ria reduïda (m) en el qual projectar el problema de gran dimensió (n), sent m << n. Una vegada s'ha projectat el problema, es resol aquest mitjançant mètodes directes, que ens proporcionen aproximacions als autovalors del problema inicial que volÃem resoldre. Les operacions que s'utilitzen en l'expansió del subespai varien en funció de si els autovalors desitjats estan en l'exterior o a l'interior de l'espectre. En cas de cercar autovalors en l'exterior de l'espectre, l'expansió es fa mitjançant multiplicacions matriu-vector. Aquesta operació la realitzem en la GPU, bé mitjançant l'ús de biblioteques o mitjançant la creació de funcions que aprofiten l'estructura de la matriu. En cas d'autovalors a l'interior de l'espectre, l'expansió requereix resoldre sistemes d'equacions lineals. En aquesta tesi implementem diversos algorismes per a la resolució de sistemes d'equacions lineals per al cas especÃfic de matrius amb estructura tridiagonal a blocs, que s'executen en GPU.
En el cà lcul de les funcions de matrius hem de diferenciar entre l'aplicació directa d'una funció sobre una matriu, f(A), i l'aplicació de l'acció d'una funció de matriu sobre un vector, f(A)b. El primer cas implica un cà lcul dens que limita la grandà ria del problema. El segon permet treballar amb matrius disperses grans, i per a resoldre-ho també fem ús de mètodes de Krylov. L'expansió del subespai es fa mitjançant multiplicacions matriu-vector, i fem ús de GPUs de la mateixa forma que en resoldre autovalors. En aquest cas el problema projectat comença sent de grandà ria m, però s'incrementa en m en cada reinici del mètode. La resolució del problema projectat es fa aplicant una funció de matriu de forma directa. Nosaltres hem implementat diversos algorismes per a calcular les funcions de matrius arrel quadrada i exponencial, en les quals l'ús de GPUs permet accelerar el cà lcul.Lamas Daviña, A. (2018). Dense and sparse parallel linear algebra algorithms on graphics processing units [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112425TESI