196 research outputs found
Transparent synchronous dataflow: a functional paradigm for systems modelling and optimisation
System modelling is the use of mathematical formalisms to model real world systems for the purpose of analysis, simulation and prediction. One of the most common ways to model a system is to create the dataflow among its various components. There are two main approaches on how dataflow graphs are constructed in these system modelling frameworks: ‘define-and-run’ vs ‘define-by-run’. The former approach first creates a dataflow graph and then executes it by pushing data into it. The latter however constructs the graph while computing with data on-the-fly. ‘Define-and-run’ is usually more efficient because many graph optimisations can be applied; ‘Define-by-run’ however handles dynamic models better. This thesis aims to develop a new functional paradigm for systems modelling and optimisation that exhibits properties of both approaches where dataflow graphs are dynamic but efficient.
We propose a new functional language, namely transparent synchronous dataflow (TSD), where dataflow graphs are constructed transparently with imperative commands to manipulate them explicitly; together with a synchronous mode of change propagation. The semantics of the language is designed on top of an unconventional graph abstract machine, Dynamic Geometry of Interaction Machine (DGoIM), which is natural for manipulating dataflow graphs. By using this semantics, the language is proved to be sound and efficient. Several experimental implementations were also created, including a native compiler for DGoIM and OCaml implementations for TSD
Deployment of Deep Neural Networks on Dedicated Hardware Accelerators
Deep Neural Networks (DNNs) have established themselves as powerful tools for
a wide range of complex tasks, for example computer vision or natural language
processing. DNNs are notoriously demanding on compute resources and as a
result, dedicated hardware accelerators for all use cases are developed. Different
accelerators provide solutions from hyper scaling cloud environments for the
training of DNNs to inference devices in embedded systems. They implement
intrinsics for complex operations directly in hardware. A common example
are intrinsics for matrix multiplication. However, there exists a gap between
the ecosystems of applications for deep learning practitioners and hardware
accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics
is still mainly defined by human hardware and software experts.
Methods to automatically utilize hardware intrinsics in DNN operators are a
subject of active research. Existing literature often works with transformationdriven
approaches, which aim to establish a sequence of program rewrites and
data-layout transformations such that the hardware intrinsic can be used to
compute the operator. However, the complexity this of task has not yet been
explored, especially for less frequently used operators like Capsule Routing. And
not only the implementation of DNN operators with intrinsics is challenging,
also their optimization on the target device is difficult. Hardware-in-the-loop
tools are often used for this problem. They use latency measurements of implementations
candidates to find the fastest one. However, specialized accelerators
can have memory and programming limitations, so that not every arithmetically
correct implementation is a valid program for the accelerator. These invalid
implementations can lead to unnecessary long the optimization time.
This work investigates the complexity of transformation-driven processes to
automatically embed hardware intrinsics into DNN operators. It is explored
with a custom, graph-based intermediate representation (IR). While operators
like Fully Connected Layers can be handled with reasonable effort, increasing
operator complexity or advanced data-layout transformation can lead to scaling issues.
Building on these insights, this work proposes a novel method to embed
hardware intrinsics into DNN operators. It is based on a dataflow analysis.
The dataflow embedding method allows the exploration of how intrinsics and
operators match without explicit transformations. From the results it can derive
the data layout and program structure necessary to compute the operator with
the intrinsic. A prototype implementation for a dedicated hardware accelerator
demonstrates state-of-the art performance for a wide range of convolutions, while
being agnostic to the data layout. For some operators in the benchmark, the
presented method can also generate alternative implementation strategies to
improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while
reducing the memory footprint. Lastly, by curating the initial set of possible
implementations for the hardware-in-the-loop optimization, the median timeto-
solution is reduced by a factor of ×2.40. At the same time, the possibility to
have prolonged searches due a bad initial set of implementations is reduced,
improving the optimization’s robustness by ×2.35
Self-Supervised Learning to Prove Equivalence Between Straight-Line Programs via Rewrite Rules
We target the problem of automatically synthesizing proofs of semantic
equivalence between two programs made of sequences of statements. We represent
programs using abstract syntax trees (AST), where a given set of
semantics-preserving rewrite rules can be applied on a specific AST pattern to
generate a transformed and semantically equivalent program. In our system, two
programs are equivalent if there exists a sequence of application of these
rewrite rules that leads to rewriting one program into the other. We propose a
neural network architecture based on a transformer model to generate proofs of
equivalence between program pairs. The system outputs a sequence of rewrites,
and the validity of the sequence is simply checked by verifying it can be
applied. If no valid sequence is produced by the neural network, the system
reports the programs as non-equivalent, ensuring by design no programs may be
incorrectly reported as equivalent. Our system is fully implemented for a given
grammar which can represent straight-line programs with function calls and
multiple types. To efficiently train the system to generate such sequences, we
develop an original incremental training technique, named self-supervised
sample selection. We extensively study the effectiveness of this novel training
approach on proofs of increasing complexity and length. Our system, S4Eq,
achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent
programsComment: 30 pages including appendi
Understanding Legacy Workflows through Runtime Trace Analysis
abstract: When scientific software is written to specify processes, it takes the form of a workflow, and is often written in an ad-hoc manner in a dynamic programming language. There is a proliferation of legacy workflows implemented by non-expert programmers due to the accessibility of dynamic languages. Unfortunately, ad-hoc workflows lack a structured description as provided by specialized management systems, making ad-hoc workflow maintenance and reuse difficult, and motivating the need for analysis methods. The analysis of ad-hoc workflows using compiler techniques does not address dynamic languages - a program has so few constrains that its behavior cannot be predicted. In contrast, workflow provenance tracking has had success using run-time techniques to record data. The aim of this work is to develop a new analysis method for extracting workflow structure at run-time, thus avoiding issues with dynamics.
The method captures the dataflow of an ad-hoc workflow through its execution and abstracts it with a process for simplifying repetition. An instrumentation system first processes the workflow to produce an instrumented version, capable of logging events, which is then executed on an input to produce a trace. The trace undergoes dataflow construction to produce a provenance graph. The dataflow is examined for equivalent regions, which are collected into a single unit. The workflow is thus characterized in terms of its treatment of an input. Unlike other methods, a run-time approach characterizes the workflow's actual behavior; including elements which static analysis cannot predict (for example, code dynamically evaluated based on input parameters). This also enables the characterization of dataflow through external tools.
The contributions of this work are: a run-time method for recording a provenance graph from an ad-hoc Python workflow, and a method to analyze the structure of a workflow from provenance. Methods are implemented in Python and are demonstrated on real world Python workflows. These contributions enable users to derive graph structure from workflows. Empowered by a graphical view, users can better understand a legacy workflow. This makes the wealth of legacy ad-hoc workflows accessible, enabling workflow reuse instead of investing time and resources into creating a workflow.Dissertation/ThesisMasters Thesis Computer Science 201
Towards Implicit Parallel Programming for Systems
Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually.
In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution.
We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
Towards Implicit Parallel Programming for Systems
Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually.
In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution.
We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
Towards Implicit Parallel Programming for Systems
Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually.
In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution.
We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
- …