857 research outputs found
We, the City
Given their unchecked neoliberal restructuring, Berlin and Istanbul have been exposed to various forms of political polarisation and social injustice over the last decade. As a result, the struggle for affordable housing, access to public space, fair working conditions, ecological justice and the right to different ways of life has intensified. Various forms of resistance "from below" have challenged the relationship between local governments and social movements, questioning where and how the city's political problems arise. In a mixture of dialogues, essays and critical reflections, this book explores the ways in which residents of Berlin and Istanbul experience, express and resist the physical, political and normative reorganisation of their cities. It poses the question: Who is the We in We, the City
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
Design and Implementation of a Portable Framework for Application Decomposition and Deployment in Edge-Cloud Systems
The emergence of cyber-physical systems has brought about a significant increase in complexity and heterogeneity in the infrastructure on which these systems are deployed. One particular example of this complexity is the interplay between cloud, fog, and edge computing. However, the complexity of these systems can pose challenges when it comes to implementing self-organizing mechanisms, which are often designed to work on flat networks. Therefore, it is essential to separate the application logic from the specific deployment aspects to promote reusability and flexibility in infrastructure exploitation.
To address this issue, a novel approach called "pulverization" has been proposed. This approach involves breaking down the system into smaller computational units, which can then be deployed on the available infrastructure.
In this thesis, the design and implementation of a portable framework that enables the "pulverization" of cyber-physical systems are presented.
The main objective of the framework is to pave the way for the deployment of cyber-physical systems in the edge-cloud continuum by reducing the complexity of the infrastructure and exploit opportunistically the heterogeneous resources available on it. Different scenarios are presented to highlight the effectiveness of the framework in different heterogeneous infrastructures and devices.
Current limitations and future work are examined to identify improvement areas for the framework
Erasure in dependently typed programming
It is important to reduce the cost of correctness in programming. Dependent types
and related techniques, such as type-driven programming, offer ways to do so.
Some parts of dependently typed programs constitute evidence of their typecorrectness
and, once checked, are unnecessary for execution. These parts can easily
become asymptotically larger than the remaining runtime-useful computation, which
can cause linear-time algorithms run in exponential time, or worse. It would be
unnacceptable, and contradict our goal of reducing the cost of correctness, to make
programs run slower by only describing them more precisely.
Current systems cannot erase such computation satisfactorily. By modelling
erasure indirectly through type universes or irrelevance, they impose the limitations
of these means to erasure. Some useless computation then cannot be erased and
idiomatic programs remain asymptotically sub-optimal.
This dissertation explains why we need erasure, that it is different from other
concepts like irrelevance, and proposes two ways of erasing non-computational data.
One is an untyped flow-based useless variable elimination, adapted for dependently
typed languages, currently implemented in the Idris 1 compiler.
The other is the main contribution of the dissertation: a dependently typed core
calculus with erasure annotations, full dependent pattern matching, and an algorithm
that infers erasure annotations from unannotated (or partially annotated) programs.
I show that erasure in well-typed programs is sound in that it commutes with
single-step reduction. Assuming the Church-Rosser property of reduction, I show
that properties such as Subject Reduction hold, which extends the soundness result
to multi-step reduction. I also show that the presented erasure inference is sound
and complete with respect to the typing rules; that this approach can be extended
with various forms of erasure polymorphism; that it works well with monadic I/O
and foreign functions; and that it is effective in that it not only removes the runtime
overhead caused by dependent typing in the presented examples, but can also shorten
compilation times."This work was supported by the University of St Andrews (School of Computer
Science)." -- Acknowledgement
Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning
Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices
Supporting Custom Instructions with the LLVM Compiler for RISC-V Processor
The rise of hardware accelerators with custom instructions necessitates
custom compiler backends supporting these accelerators. This study provides
detailed analyses of LLVM and its RISC-V backend, supplemented with case
studies providing end-to-end overview of the mentioned transformations.
We discuss that instruction design should consider both hardware and software
design space. The necessary compiler modifications may mean that the
instruction is not well designed and need to be reconsidered. We discuss that
RISC-V standard extensions provide exemplary instructions that can guide
instruction designers.
In this study, the process of adding a custom instruction to compiler is
split into two parts as Assembler support and pattern matching support. Without
pattern matching support, conventional software requires manual entries of
inline Assembly for the accelerator which is not scalable. While it is trivial
to add Assembler support regardless of the instruction semantics, pattern
matching support is on the contrary. Pattern matching support and choosing the
right stage for the modification, requires the knowledge of the internal
transformations in the compiler. This study delves deep into pattern matching
and presents multiple ways to approach the problem of pattern matching support.
It is discussed that depending on the pattern's complexity, higher level
transformations, e.g. IR level, can be more maintainable compared to
Instruction Selection phase.Comment: Electronics and Communication Engineering B.Sc. Graduation Project.
Source can be found in https://github.com/eymay/Senior-Design-Projec
- …