178 research outputs found
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning
Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Automatic generation of highly concurrent, hierarchical and heterogeneous cache coherence protocols from atomic specifications
Cache coherence protocols are often specified using only stable states and atomic transactions
for a single cache hierarchy level. Designing highly-concurrent, hierarchical and heterogeneous directory cache coherence protocols from these atomic specifications for modern
multicore architectures is a complicated task. To overcome these design challenges we have
developed the novel *Gen algorithms (ProtoGen, HieraGen and HeteroGen).
Using the *Gen
algorithms highly-concurrent, hierarchical and heterogeneous cache coherence protocols can
be automatically generated for a wide range of atomic input stable state protocol (SSP) speci fications - including the MOESI variants, as well as for protocols that are targeted towards
Total Store Order and Release Consistency. In addition, for each *Gen algorithm we have
developed and published an eponymous tool.
The ProtoGen tool takes as input a single SSP (i.e., no concurrency) generating the corresponding protocol for a multicore architecture with non-atomic transactions. The ProtoGen
algorithm automatically enforces the correct interleaving of conflicting coherence transactions
for a given atomic coherence protocol specification.
HieraGen is a tool for automatically generating hierarchical cache coherence protocols.
Its inputs are SSPs for each level of the hierarchy and its output is a highly concurrent
hierarchical protocol. HieraGen thus reduces the complexity that architects face by offloading
the challenging task of composing protocols and managing concurrency.
HeteroGen is a tool for automatically generating heterogeneous protocols that adhere to
precise consistency models. As input, HeteroGen takes SSPs of the per-cluster coherence
protocols, each of which satisfies its own per-cluster consistency model. The output is a
concurrent (i.e., with transient states) heterogeneous protocol that satisfies a precisely defined
consistency model that we refer to as a compound consistency model.
To validate the correctness of the *Gen algorithms, the generated output protocols were
verified for safety and deadlock freedom using a model checker. To verify the correctness
of protocols that need to adhere to a specific compound consistency model generated by
HeteroGen, novel litmus tests for multiple compound consistency models were developed.
The protocols automatically generated using the *Gen tools have a comparable or better
performance than manually generated cache coherence protocols, often discovering opportunities to reduce stalls. Thus, the *Gen tools reduce the complexity that architects face by
offloading the challenging tasks of composing protocols and managing concurrency
Computer Aided Verification
This open access two-volume set LNCS 13371 and 13372 constitutes the refereed proceedings of the 34rd International Conference on Computer Aided Verification, CAV 2022, which was held in Haifa, Israel, in August 2022. The 40 full papers presented together with 9 tool papers and 2 case studies were carefully reviewed and selected from 209 submissions. The papers were organized in the following topical sections: Part I: Invited papers; formal methods for probabilistic programs; formal methods for neural networks; software Verification and model checking; hyperproperties and security; formal methods for hardware, cyber-physical, and hybrid systems. Part II: Probabilistic techniques; automata and logic; deductive verification and decision procedures; machine learning; synthesis and concurrency. This is an open access book
Online learning on the programmable dataplane
This thesis makes the case for managing computer networks with datadriven methods automated statistical inference and control based on measurement data and runtime observations—and argues for their tight integration with programmable dataplane hardware to make management decisions faster and from more precise data. Optimisation, defence, and measurement of networked infrastructure are each challenging tasks in their own right, which are currently dominated by the use of hand-crafted heuristic methods. These become harder to reason about and deploy as networks scale in rates and number of forwarding elements, but their design requires expert knowledge and care around unexpected protocol interactions. This makes tailored, per-deployment or -workload solutions infeasible to develop. Recent advances in machine learning offer capable function approximation and closed-loop control which suit many of these tasks. New, programmable dataplane hardware enables more agility in the network— runtime reprogrammability, precise traffic measurement, and low latency on-path processing. The synthesis of these two developments allows complex decisions to be made on previously unusable state, and made quicker by offloading inference to the network.
To justify this argument, I advance the state of the art in data-driven defence of networks, novel dataplane-friendly online reinforcement learning algorithms, and in-network data reduction to allow classification of switchscale data. Each requires co-design aware of the network, and of the failure modes of systems and carried traffic. To make online learning possible in the dataplane, I use fixed-point arithmetic and modify classical (non-neural) approaches to take advantage of the SmartNIC compute model and make use of rich device local state. I show that data-driven solutions still require great care to correctly design, but with the right domain expertise they can improve on pathological cases in DDoS defence, such as protecting legitimate UDP traffic. In-network aggregation to histograms is shown to enable accurate classification from fine temporal effects, and allows hosts to scale such classification to far larger flow counts and traffic volume. Moving reinforcement learning to the dataplane is shown to offer substantial benefits to stateaction latency and online learning throughput versus host machines; allowing policies to react faster to fine-grained network events. The dataplane environment is key in making reactive online learning feasible—to port further algorithms and learnt functions, I collate and analyse the strengths of current and future hardware designs, as well as individual algorithms
Proceedings of the 22nd Conference on Formal Methods in Computer-Aided Design – FMCAD 2022
The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing
Shader optimization and specialization
In the field of real-time graphics for computer games, performance has a significant effect on the player’s enjoyment and immersion. Graphics processing units (GPUs) are
hardware accelerators that run small parallelized shader programs to speed up computationally expensive rendering calculations. This thesis examines optimizing shader
programs and explores ways in which data patterns on both the CPU and GPU can be
analyzed to automatically speed up rendering in games.
Initially, the effect of traditional compiler optimizations on shader source-code
was explored. Techniques such as loop unrolling or arithmetic reassociation provided
speed-ups on several devices, but different GPU hardware responded differently to
each set of optimizations. Analyzing execution traces from numerous popular PC
games revealed that much of the data passed from CPU-based API calls to GPU-based
shaders is either unused, or remains constant. A system was developed to capture this
constant data and fold it into the shaders’ source-code. Re-running the game’s rendering code using these specialized shader variants resulted in performance improvements
in several commercial games without impacting their visual quality
- …