196 research outputs found

    Transparent synchronous dataflow: a functional paradigm for systems modelling and optimisation

    Get PDF
    System modelling is the use of mathematical formalisms to model real world systems for the purpose of analysis, simulation and prediction. One of the most common ways to model a system is to create the dataflow among its various components. There are two main approaches on how dataflow graphs are constructed in these system modelling frameworks: ‘define-and-run’ vs ‘define-by-run’. The former approach first creates a dataflow graph and then executes it by pushing data into it. The latter however constructs the graph while computing with data on-the-fly. ‘Define-and-run’ is usually more efficient because many graph optimisations can be applied; ‘Define-by-run’ however handles dynamic models better. This thesis aims to develop a new functional paradigm for systems modelling and optimisation that exhibits properties of both approaches where dataflow graphs are dynamic but efficient. We propose a new functional language, namely transparent synchronous dataflow (TSD), where dataflow graphs are constructed transparently with imperative commands to manipulate them explicitly; together with a synchronous mode of change propagation. The semantics of the language is designed on top of an unconventional graph abstract machine, Dynamic Geometry of Interaction Machine (DGoIM), which is natural for manipulating dataflow graphs. By using this semantics, the language is proved to be sound and efficient. Several experimental implementations were also created, including a native compiler for DGoIM and OCaml implementations for TSD

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35

    Self-Supervised Learning to Prove Equivalence Between Straight-Line Programs via Rewrite Rules

    Full text link
    We target the problem of automatically synthesizing proofs of semantic equivalence between two programs made of sequences of statements. We represent programs using abstract syntax trees (AST), where a given set of semantics-preserving rewrite rules can be applied on a specific AST pattern to generate a transformed and semantically equivalent program. In our system, two programs are equivalent if there exists a sequence of application of these rewrite rules that leads to rewriting one program into the other. We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs. The system outputs a sequence of rewrites, and the validity of the sequence is simply checked by verifying it can be applied. If no valid sequence is produced by the neural network, the system reports the programs as non-equivalent, ensuring by design no programs may be incorrectly reported as equivalent. Our system is fully implemented for a given grammar which can represent straight-line programs with function calls and multiple types. To efficiently train the system to generate such sequences, we develop an original incremental training technique, named self-supervised sample selection. We extensively study the effectiveness of this novel training approach on proofs of increasing complexity and length. Our system, S4Eq, achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent programsComment: 30 pages including appendi

    Understanding Legacy Workflows through Runtime Trace Analysis

    Get PDF
    abstract: When scientific software is written to specify processes, it takes the form of a workflow, and is often written in an ad-hoc manner in a dynamic programming language. There is a proliferation of legacy workflows implemented by non-expert programmers due to the accessibility of dynamic languages. Unfortunately, ad-hoc workflows lack a structured description as provided by specialized management systems, making ad-hoc workflow maintenance and reuse difficult, and motivating the need for analysis methods. The analysis of ad-hoc workflows using compiler techniques does not address dynamic languages - a program has so few constrains that its behavior cannot be predicted. In contrast, workflow provenance tracking has had success using run-time techniques to record data. The aim of this work is to develop a new analysis method for extracting workflow structure at run-time, thus avoiding issues with dynamics. The method captures the dataflow of an ad-hoc workflow through its execution and abstracts it with a process for simplifying repetition. An instrumentation system first processes the workflow to produce an instrumented version, capable of logging events, which is then executed on an input to produce a trace. The trace undergoes dataflow construction to produce a provenance graph. The dataflow is examined for equivalent regions, which are collected into a single unit. The workflow is thus characterized in terms of its treatment of an input. Unlike other methods, a run-time approach characterizes the workflow's actual behavior; including elements which static analysis cannot predict (for example, code dynamically evaluated based on input parameters). This also enables the characterization of dataflow through external tools. The contributions of this work are: a run-time method for recording a provenance graph from an ad-hoc Python workflow, and a method to analyze the structure of a workflow from provenance. Methods are implemented in Python and are demonstrated on real world Python workflows. These contributions enable users to derive graph structure from workflows. Empowered by a graphical view, users can better understand a legacy workflow. This makes the wealth of legacy ad-hoc workflows accessible, enabling workflow reuse instead of investing time and resources into creating a workflow.Dissertation/ThesisMasters Thesis Computer Science 201

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
    • …
    corecore