338 research outputs found

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces

    The inherent overlapping in the parallel calculation of the Laplacian

    Get PDF
    Producción CientíficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla León (grant number VA296P18

    Procesamiento paralelo : Conceptos de arquitecturas y algoritmos

    Get PDF
    Históricamente, la única forma de tratar algunos problemas de procesamiento ha sido por medio de cómputo paralelo. A mayor complejidad de cálculo y mayor compromiso con la ejecución en tiempo real (inteligencia artificial, redes neuronales, rebotica, reconocimiento de patrones, visualización científica, modelos de elementos finitos y de fluidos, manejo de grandes bases de datos, etc.), se hace imprescindible utilizar procesamiento paralelo para obtener tiempos de respuesta aceptables. En los primeros capítulos de este libro se analizan aspectos de la arquitectura (organización del control, memoria y comunicaciones) de los sistemas de procesamiento paralelo. En los capítulos intermedios se tratan algorítmicamente clases de problemas de cómputo paralelo como los de cálculo numérico, ordenación, grafos e imágenes. Se discuten los temas de concurrencia y sincronización entre procesos que se ejecutan en paralelo. Asimismo se analizan parámetros de rendimiento tales como el factor de Speed-Up y el grado de paralelismo alcanzable en los ejemplos que se exponen. En los últimos capítulos se desarrolla extensivamente el problema de migración de procesos en arquitecturas paralelas distribuidas. Se presenta un entorno de experimentación (diseño e implementación) de migración de procesos, así como resultados de experimentación con distintas políticas de manejo de mensajes. La intención de este libro no es abarcar todos los temas posibles de lo que tradicionalmente se ha denominado procesamiento paralelo. Lo que sí se intenta conservar son las ideas de aplicación y rendimiento del procesamiento paralelo, estudiando casos y problemas con el enfoque de optimizar el diseño de la arquitectura y del software. La idea subyacente de aplicación del cómputo paralelo guía en cierto sentido la presentación de los temas. Se hacen explícitas las simplificaciones y suposiciones en los análisis teóricos presentados y se discuten los ejemplos tratando de extraer de ellos las ideas comunes con respecto a la clase de problemas a la que pertenecen.Tesis digitalizada en SEDICI gracias a la Biblioteca del Departamento de Matemática de la Facultad de Ciencias Exactas (UNLP).Facultad de Ciencias Exacta

    Real-time analysis of MPI programs for NoC-based many-cores using time division multiplexing

    Get PDF
    Worst-case execution time (WCET) analysis is crucial for designing hard real-time systems. While the WCET of tasks in a single core system can be upper bounded in isolation, the tasks in a many-core system are subject to shared memory interferences which impose high overestimation of the WCET bounds. However, many-core-based massively parallel applications will enter the area of real-time systems in the years ahead. Explicit message-passing and a clear separation of computation and communication facilitates WCET analysis for those programs. A standard programming model for message-based communication is the message passing interface (MPI). It provides an application independent interface for different standard communication operations (e.g. broadcast, gather, ...). Thereby, it uses efficient communication patterns with deterministic behaviour. In applying these known structures, we target to provide a WCET analysis for communication that is reusable for different applications if the communication is executed on the same underlying platform. Hence, the analysis must be performed once per hardware platform and can be reused afterwards with only adapting several parameters such as the number of nodes participating in that communication. Typically, the processing elements of many-core platforms are connected via a Network-on-Chip (NoC) and apply techniques such as time-division multiplexing (TDM) to provide guaranteed services for the network. Hence, the hardware and the applied technique for guaranteed service needs to facilitate this reusability of the analysis as well. In this work we review different general-purpose TDM schedules that enable a WCET approximation independent of the placement of tasks on processing elements of a many-core which uses a NoC with torus topology. Furthermore, we provide two new schedules that show a similar performance as the state-of-the-art schedules but additionally serve situations where the presented state-of-the-art schedules perform poorly. Based on these schedules a procedure for the WCET analysis of the communication patterns used in MPI is proposed. Finally, we show how to apply the results of the analysis to calculate the WCET upper bound for a complete MPI program. Detailed insights in the performance of the applied TDM schedules are provided by comparing the schedules to each other in terms of timing. Additionally, we discuss the exhibited timing of the general-purpose schedules compared to a state-of-the-art application specific TDM schedule to put in relation both types of schedules. We apply the proposed procedure to several standard types of communication provided in MPI and compare different patterns that are used to implement a specific communication. Our evaluation investigates the communications’ building blocks of the timing bounds and shows the tremendous impact of choosing the appropriate communication pattern. Finally, a case study demonstrates the application of the presented procedure to a complete MPI program. With the method proposed in this work it is possible to perform a reusable WCET timing analysis for the communication in a NoC that is independent of the placement of tasks on the chip. Moreover, as the applied schedules are not optimized for a specific application but can be used for all applications in the same way, there are only marginal changes in the timing of the communication when the software is adapted or updated. Thus, there is no need to perform the timing analysis from scratch in such cases

    Network-on-Chip

    Get PDF
    Limitations of bus-based interconnections related to scalability, latency, bandwidth, and power consumption for supporting the related huge number of on-chip resources result in a communication bottleneck. These challenges can be efficiently addressed with the implementation of a network-on-chip (NoC) system. This book gives a detailed analysis of various on-chip communication architectures and covers different areas of NoCs such as potentials, architecture, technical challenges, optimization, design explorations, and research directions. In addition, it discusses current and future trends that could make an impactful and meaningful contribution to the research and design of on-chip communications and NoC systems

    Architectures for Multinode Superconducting Quantum Computers

    Full text link
    Many proposals to scale quantum technology rely on modular or distributed designs where individual quantum processors, called nodes, are linked together to form one large multinode quantum computer (MNQC). One scalable method to construct an MNQC is using superconducting quantum systems with optical interconnects. However, a limiting factor of these machines will be internode gates, which may be two to three orders of magnitude noisier and slower than local operations. Surmounting the limitations of internode gates will require a range of techniques, including improvements in entanglement generation, the use of entanglement distillation, and optimized software and compilers, and it remains unclear how improvements to these components interact to affect overall system performance, what performance from each is required, or even how to quantify the performance of each. In this paper, we employ a `co-design' inspired approach to quantify overall MNQC performance in terms of hardware models of internode links, entanglement distillation, and local architecture. In the case of superconducting MNQCs with microwave-to-optical links, we uncover a tradeoff between entanglement generation and distillation that threatens to degrade performance. We show how to navigate this tradeoff, lay out how compilers should optimize between local and internode gates, and discuss when noisy quantum links have an advantage over purely classical links. Using these results, we introduce a roadmap for the realization of early MNQCs which illustrates potential improvements to the hardware and software of MNQCs and outlines criteria for evaluating the landscape, from progress in entanglement generation and quantum memory to dedicated algorithms such as distributed quantum phase estimation. While we focus on superconducting devices with optical interconnects, our approach is general across MNQC implementations.Comment: 23 pages, white pape

    Parallel Algorithms and Generalized Frameworks for Learning Large-Scale Bayesian Networks

    Get PDF
    Bayesian networks (BNs) are an important subclass of probabilistic graphical models that employ directed acyclic graphs to compactly represent exponential-sized joint probability distributions over a set of random variables. Since BNs enable probabilistic reasoning about interactions between the variables of interest, they have been successfully applied in a wide range of applications in the fields of medical diagnosis, gene networks, cybersecurity, epidemiology, etc. Furthermore, the recent focus on the need for explainability in human-impact decisions made by machine learning (ML) models has led to a push for replacing the prevalent black-box models with inherently interpretable models like BNs for making high-stakes decisions in hitherto unexplored areas. Learning the exact structure of BNs from observational data is an NP-hard problem and therefore a wide range of heuristic algorithms have been developed for this purpose. However, even the heuristic algorithms are computationally intensive. The existing software packages for BN structure learning with implementations of multiple algorithms are either completely sequential or support limited parallelism and can take days to learn BNs with even a few thousand variables. Previous parallelization efforts have focused on one or two algorithms for specific applications and have not resulted in broadly applicable parallel software. This has prevented BNs from becoming a viable alternative to other ML models. In this dissertation, we develop efficient parallel versions of a variety of BN learning algorithms from two categories: six different constraint-based methods and a score-based method for constructing a specialization of BNs known as module networks. We also propose optimizations for the implementations of these parallel algorithms to achieve maximum performance in practice. Our proposed algorithms are scalable to thousands of cores and outperform the previous state-of-the-art by a large margin. We have made the implementations available as open-source software packages that can be used by ML and application-domain researchers for expeditious learning of large-scale BNs.Ph.D

    Machine Learning Techniques for Performance Prediction and Diagnosis of VLSI Designs

    Get PDF
    As the cost of scaling-down the manufacturing process of integrated circuits grows larger and its performance gains become smaller, designs must grow in complexity in order to achieve expected performance improvements. As this complexity grows, the development of automation tools for design, validation, and debug is critical. The number of machine learning-based techniques aiming to improve available tools has grown rapidly in recent years, as machine learning has proven an extraordinary capability of extracting knowledge from data and handling complicated non-linear behaviors, which makes it the best approach to mimic a human manual process among mathematical or algorithmic options. The work presented in this dissertation aims to evaluate the application of machine learning techniques in two different areas of the integrated circuit design process: pre-routing timing prediction and performance debugging of microprocessor cores. The strategy proposed for pre-route timing prediction is based on machine learning models that predict the post-routing timing using only placed, but un-routed circuit databases. This strategy prevents over-design due to pessimistic timing estimations, as well as it saves time by reducing the need of multiple design iterations caused by the use of inaccurate timing estimations to guide circuit optimizations such as gate resizing, logic restructuring, or threshold voltage assignment leading to design violations once routing is executed. The obtained results show that our models achieve a prediction quality on-par with a sign-off static timing analysis commercial tool, with a 3× speedup. For the performance debug of microprocessor cores task, we focus on bugs that affect the generation-by-generation performance improvement in new designs. This task is very challenging due to the lack of an accurate golden performance model, unlike its functional counterpart. In addition, there is a limited visibility of the performance on intermediate steps of the design, and overall, the debugging infrastructure is lacking, which makes this problem even more challenging. Currently this process is executed on a highly manual manner, which requires large amounts of time to fully characterize a bug. Therefore, automated techniques for performance debugging are essential to keep-up the performance gains obtained by new microarchitectural designs. In this dissertation, we focus on detecting the presence of a performance bug and localizing the microarchitectural unit on which the bug might be, more detailed debugging is left for future work. Our proposed techniques achieved up to a 91.5% of bug detection, and up to a 98% top-3 (out of 16 possible) bug localization accuracy on bugs with average IPC impact > 1%

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
    corecore