71 research outputs found

    Iterative Schedule Optimization for Parallelization in the Polyhedron Model

    Get PDF
    In high-performance computing, one primary objective is to exploit the performance that the given target hardware can deliver to the fullest. Compilers that have the ability to automatically optimize programs for a specific target hardware can be highly useful in this context. Iterative (or search-based) compilation requires little or no prior knowledge and can adapt more easily to concrete programs and target hardware than static cost models and heuristics. Thereby, iterative compilation helps in situations in which static heuristics do not reflect the combination of input program and target hardware well. Moreover, iterative compilation may enable the derivation of more accurate cost models and heuristics for optimizing compilers. In this context, the polyhedron model is of help as it provides not only a mathematical representation of programs but, more importantly, a uniform representation of complex sequences of program transformations by schedule functions. The latter facilitates the systematic exploration of the set of legal transformations of a given program. Early approaches to purely iterative schedule optimization in the polyhedron model do not limit their search to schedules that preserve program semantics and, thereby, suffer from the need to explore numbers of illegal schedules. More recent research ensures the legality of program transformations but presumes a sequential rather than a parallel execution of the transformed program. Other approaches do not perform a purely iterative optimization. We propose an approach to iterative schedule optimization for parallelization and tiling in the polyhedron model. Our approach targets loop programs that profit from data locality optimization and coarse-grained loop parallelization. The schedule search space can be explored either randomly or by means of a genetic algorithm. To determine a schedule's profitability, we rely primarily on measuring the transformed code's execution time. While benchmarking is accurate, it increases the time and resource consumption of program optimization tremendously and can even make it impractical. We address this limitation by proposing to learn surrogate models from schedules generated and evaluated in previous runs of the iterative optimization and to replace benchmarking by performance prediction to the extent possible. Our evaluation on the PolyBench 4.1 benchmark set reveals that, in a given setting, iterative schedule optimization yields significantly higher speedups in the execution of the program to be optimized. Surrogate performance models learned from training data that was generated during previous iterative optimizations can reduce the benchmarking effort without strongly impairing the optimization result. A prerequisite for this approach is a sufficient similarity between the training programs and the program to be optimized

    Optimizing work stealing algorithms with scheduling constraints

    Get PDF
    The fork-join paradigm of concurrent expression has gained popularity in conjunction with work-stealing schedulers. Random work-stealing schedulers have been shown to effectively perform dynamic load balancing, yielding provably-efficient schedules and space bounds on shared-memory architectures with uniform memory models. However, the advent of hierarchical, non-uniform multicore systems and large-scale distributed-memory architectures has reduced the efficacy of these scheduling policies. Furthermore, random work stealing schedulers do not exploit persistence within iterative, scientific applications. In this thesis, we prove several properties of work-stealing schedulers that enable online tracing of the tasks with very low overhead. We then describe new scheduling policies that use online schedule introspection to understand scheduler placement and thus improve the performance on NUMA and distributed-memory architectures. Finally, by incorporating an inclusive data effect system into fork--join programs with schedule placement knowledge, we show how we can transform a fork-join program to significantly improve locality

    A method for the architectural design of distributed control systems for large, civil jet engines: a systems engineering approach

    Get PDF
    The design of distributed control systems (DCSs) for large, civil gas turbine engines is a complex architectural challenge. To date, the majority of research into DCSs has focused on the contributing technologies and high temperature electronics rather than the architecture of the system itself. This thesis proposes a method for the architectural design of distributed systems using a genetic algorithm to generate, evaluate and refine designs. The proposed designs are analysed for their architectural quality, lifecycle value and commercial benefit. The method is presented along with results proving the concept. Whilst the method described here is applied exclusively to Distributed Control System (DCS) for jet engines, the principles and methods could be adapted for a broad range of complex systems

    Application of map/reduce paradigm in supercomputing systems

    Get PDF
    Ankara : The Depertmant of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2013.Thesis (Master's) -- Bilkent University, 2013.Includes bibliographical references leaves 47-49.Map/Reduce is a framework first introduced by Google in order to rapidly develop big data analytic applications on distributed computing systems. Even though the Map/Reduce paradigm had a game changing impact on certain fields of computer science such as information retrieval and data mining, it did not have such an impact on the scientific computing domain yet. The current implementations of Map/Reduce are especially designed for commodity PC clusters, where failures of compute nodes are common and inter-processor communication is slow. However, scientific computing applications are usually executed on high performance computing (HPC) systems and such systems provide high communication bandwidth with low message latency where failures of processors are rare. Therefore, Map/Reduce framework causes performance degradation and becomes less preferable in scientific computing domain. Due to these reasons, specific implementations of Map/Reduce paradigm are needed for scientific computing domain. Among the existing implementations, we focus our attention on the MapReduce-MPI (MR-MPI) library developed at Sandia National Labs. In this thesis, we argue that by utilizing MR-MPI Library, the Map/Reduce programming paradigm can be successfully utilized for scientific computing applications that require scalability and performance. We tested MR-MPI Library in HPC systems with several fundamental algorithms that are frequently used in scientific computing and data mining domains. Implemented algorithms include all-pair-similarity-search (APSS), all-pair-shortest-path (APSP), and page-rank (PR). Tests were performed on well-known large-scale HPC systems IBM BlueGene/Q (Juqueen) and Cray XE6 (Hermit) to examine scalability and speedup of these algorithms.Demirci, Gündüz VehbiM.S

    Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming

    Get PDF
    This work focuses on compiler and run-time techniques for improving the productivity and the performance portability of general-purpose parallel programming. More specifically, we focus on shared-memory task-parallel languages, where the programmer explicitly exposes parallelism in the form of short tasks that may outnumber the cores by orders of magnitude. The compiler, the run-time, and the platform (henceforth the system) are responsible for harnessing this unpredictable amount of parallelism, which can vary from none to excessive, towards efficient execution. The challenge arises from the aspiration to support fine-grained irregular computations and nested parallelism. This work is even more ambitious by also aspiring to lay the foundations to efficiently support declarative code, where the programmer exposes all available parallelism, using high-level language constructs such as parallel loops, reducers or futures. The appeal of declarative code is twofold for general-purpose programming: it is often easier for the programmer who does not have to worry about the granularity of the exposed parallelism, and it achieves better performance portability by avoiding overfitting to a small range of platforms and inputs for which the programmer is coarsening. Furthermore, PRAM algorithms, an important class of parallel algorithms, naturally lend themselves to declarative programming, so supporting it is a necessary condition for capitalizing on the wealth of the PRAM theory. Unfortunately, declarative codes often expose such an overwhelming number of fine-grained tasks that existing systems fail to deliver performance. Our contributions can be partitioned into three components. First, we tackle the issue of coarsening, which declarative code leaves to the system. We identify two goals of coarsening and advocate tackling them separately, using static compiler transformations for one and dynamic run-time approaches for the other. Additionally, we present evidence that the current practice of burdening the programmer with coarsening either leads to codes with poor performance-portability, or to a significantly increased programming effort. This is a ``show-stopper'' for general-purpose programming. To compare the performance portability among approaches, we define an experimental framework and two metrics, and we demonstrate that our approaches are preferable. We close the chapter on coarsening by presenting compiler transformations that automatically coarsen some types of very fine-grained codes. Second, we propose Lazy Scheduling, an innovative run-time scheduling technique that infers the platform load at run-time, using information already maintained. Based on the inferred load, Lazy Scheduling adapts the amount of available parallelism it exposes for parallel execution and, thus, saves parallelism overheads that existing approaches pay. We implement Lazy Scheduling and present experimental results on four different platforms. The results show that Lazy Scheduling is vastly superior for declarative codes and competitive, if not better, for coarsened codes. Moreover, Lazy Scheduling is also superior in terms of performance-portability, supporting our thesis that it is possible to achieve reasonable efficiency and performance portability with declarative codes. Finally, we also implement Lazy Scheduling on XMT, an experimental manycore platform developed at the University of Maryland, which was designed to support codes derived from PRAM algorithms. On XMT, we manage to harness the existing hardware support for scheduling flat parallelism to compose it with Lazy Scheduling, which supports nested parallelism. In the resulting hybrid scheduler, the hardware and software work in synergy to overcome each other's weaknesses. We show the performance composability of the hardware and software schedulers, both in an abstract cost model and experimentally, as the hybrid always performs better than the software scheduler alone. Furthermore, the cost model is validated by using it to predict if it is preferable to execute a code sequentially, with outer parallelism, or with nested parallelism, depending on the input, the available hardware parallelism and the calling context of the parallel code

    Análise de alcançabilidade em sistemas max plus incertos

    Get PDF
    Orientadores: Rafael Santos Mendes, Laurent Hardouin, Mehdi LhommeauTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Os Sistemas a Eventos Discretos (SEDs) constituem uma classe de sistemas caracterizada por apresentar espaço de estados discreto e dinâmica dirigida única e exclusivamente pela ocorrência de eventos. SEDs sujeitos aos problemas de sincronização e de temporização podem ser descritos em termos de equações lineares usando a álgebra max-plus. A análise de alcançabilidade visa o cálculo do conjunto de todos os estados que podem ser alcançados a partir de um conjunto de estados iniciais através do modelo do sistema. A análise de alcançabilidade de sistemas Max Plus Lineares (MPL) pode ser tratada por meio da decomposição do sistema MPL em sistemas PWA (Piece-Wise Affine) e de sua correspondente representação por DBM (Difference-Bound Matrices). A principal contribuição desta tese é a proposta de uma metodologia similar para resolver o problema de análise de alcançabilidade em sistemas MPL sujeitos a ruídos limitados, chamados de sistemas MPL incertos ou sistemas uMPL (uncertain Max Plus Linear Systems). Primeiramente, apresentamos uma metodologia para particionar o espaço de estados de um sistema uMPL em componentes que podem ser completamente representados por DBM. Em seguida, estendemos a análise de alcançabilidade de sistemas MPL para sistemas uMPL. Além disso, a metodologia desenvolvida é usada para resolver o problema de análise de alcançabilidade condicional, o qual esta estritamente relacionado ao cálculo do suporte da função de probabilidade de densidade envolvida no problema de filtragem estocásticaAbstract: Discrete Event Dynamic Systems (DEDS) are discrete-state systems whose dynamics are entirely driven by the occurrence of asynchronous events over time. Linear equations in the max-plus algebra can be used to describe DEDS subjected to synchronization and time delay phenomena. The reachability analysis concerns the computation of all states that can be reached by a dynamical system from an initial set of states. The reachability analysis problem of Max Plus Linear (MPL) systems has been properly solved by characterizing the MPL systems as a combination of Piece-Wise Affine (PWA) systems and then representing each component of the PWA system as Difference-Bound Matrices (DBM). The main contribution of this thesis is to present a similar procedure to solve the reachability analysis problem of MPL systems subjected to bounded noise, disturbances and/or modeling errors, called uncertain MPL (uMPL) systems. First, we present a procedure to partition the state space of an uMPL system into components that can be completely represented by DBM. Then we extend the reachability analysis of MPL systems to uMPL systems. Moreover, the results on reachability analysis of uMPL systems are used to solve the conditional reachability problem, which is closely related to the support calculation of the probability density function involved in the stochastic filtering problemDoutoradoAutomaçãoDoutor em Engenharia Elétrica164765/2013-199999.002340/2015-01CNPQCAPE

    Mechanising an algebraic rely-guarantee refinement calculus

    Get PDF
    PhD ThesisDespite rely-guarantee (RG) being a well-studied program logic established in the 1980s, it was not until recently that researchers realised that rely and guarantee conditions could be treated as independent programming constructs. This recent reformulation of RG paved the way to algebraic characterisations which have helped to better understand the difficulties that arise in the practical application of this development approach. The primary focus of this thesis is to provide automated tool support for a rely-guarantee refinement calculus proposed by Hayes et. al., where rely and guarantee are defined as independent commands. Our motivation is to investigate the application of an algebraic approach to derive concrete examples using this calculus. In the course of this thesis, we locate and fix a few issues involving the refinement language, its operational semantics and preexisting proofs. Moreover, we extend the refinement calculus of Hayes et. al. to cover indexed parallel composition, non-atomic evaluation of expressions within specifications, and assignment to indexed arrays. These extensions are illustrated via concrete examples. Special attention is given to design decisions that simplify the application of the mechanised theory. For example, we leave part of the design of the expression language on the hands of the user, at the cost of the requiring the user to define the notion of undefinedness for unary and binary operators; and we also formalise a notion of indexed parallelism that is parametric on the type of the indexes, this is done deliberately to simplify the formalisation of algorithms. Additionally, we use stratification to reduce the number of cases in in simulation proofs involving the operational semantics. Finally, we also use the algebra to discuss the role of types in program derivation

    Computer science I like proceedings of miniconference on 4.11.2011

    Get PDF

    Classification and Management of Computational Resources of Robotic Swarms and the Overcoming of their Constraints

    Get PDF
    Swarm robotics is a relatively new and multidisciplinary research field with many potential applications (e.g., collective exploration or precision agriculture). Nevertheless, it has not been able to transition from the academic environment to the real world. While there are many potential reasons, one reason is that many robots are designed to be relatively simple, which often results in reduced communication and computation capabilities. However, the investigation of such limitations has largely been overlooked. This thesis looks into one such constraint, the computational constraint of swarm robots (i.e., swarm robotics platform). To achieve this, this work first proposes a computational index that quantifies computational resources. Based on the computational index, a quantitative study of 5273 devices shows that swarm robots provide fewer resources than many other robots or devices. In the next step, an operating system with a novel dual-execution model is proposed, and it has been shown that it outperforms the two other robotic system software. Moreover, results show that the choice of system software determines the computational overhead and, therefore, how many resources are available to robotic software. As communication can be a key aspect of a robot's behaviour, this work demonstrates the modelling, implementing, and studying of an optical communication system with a novel dynamic detector. Its detector improves the quality of service by orders of magnitude (i.e., makes the communication more reliable). In addition, this work investigates general communication properties, such as scalability or the effects of mobility, and provides recommendations for the use of such optical communication systems for swarm robotics. Finally, an approach is shown by which computational constraints of individual robots can be overcome by distributing data and processing across multiple robots

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
    corecore