3,047 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
GHOST: A Graph Neural Network Accelerator using Silicon Photonics
Graph neural networks (GNNs) have emerged as a powerful approach for
modelling and learning from graph-structured data. Multiple fields have since
benefitted enormously from the capabilities of GNNs, such as recommendation
systems, social network analysis, drug discovery, and robotics. However,
accelerating and efficiently processing GNNs require a unique approach that
goes beyond conventional artificial neural network accelerators, due to the
substantial computational and memory requirements of GNNs. The slowdown of
scaling in CMOS platforms also motivates a search for alternative
implementation substrates. In this paper, we present GHOST, the first
silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates
the costs associated with both vertex-centric and edge-centric operations. It
implements separately the three main stages involved in running GNNs in the
optical domain, allowing it to be used for the inference of various widely used
GNN models and architectures, such as graph convolution networks and graph
attention networks. Our simulation studies indicate that GHOST exhibits at
least 10.2x better throughput and 3.8x better energy efficiency when compared
to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs
Deploying deep learning models in cloud clusters provides efficient and
prompt inference services to accommodate the widespread application of deep
learning. These clusters are usually equipped with host CPUs and accelerators
with distinct responsibilities to handle serving requests, i.e. generalpurpose
CPUs for input preprocessing and domain-specific GPUs for forward computation.
Recurrent neural networks play an essential role in handling temporal inputs
and display distinctive computation characteristics because of their high
inter-operator parallelism. Hence, we propose Chrion to optimize recurrent
neural network inference by collaboratively utilizing CPUs and GPUs. We
formulate the model deployment in the CPU-GPU cluster as an NP-hard scheduling
problem of directed acyclic graphs on heterogeneous devices. Given an input
model in the ONNX format and user-defined SLO requirement, Chrion firstly
preprocesses the model by model parsing and profiling, and then partitions the
graph to select execution devices for each operator. When an online request
arrives, Chrion performs forward computation according to the graph partition
by executing the operators on the CPU and GPU in parallel. Our experimental
results show that the execution time can be reduced by 19.4% at most in the
latency-optimal pattern and GPU memory footprint by 67.5% in the memory-optimal
pattern compared with the execution on the GPU
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving
Recent breakthroughs in Deep Learning (DL) have led to high demand for executing inferences in interactive services such as ChatGPT and GitHub Copilot. However, these interactive services require low-latency inferences, which can only be met with GPUs and result in exorbitant operating costs. For instance, ChatGPT reportedly requires millions of U.S. dollars in cloud GPUs to serve its 1+ million users. A potential solution to meet low-latency requirements with acceptable costs is to use serverless platforms. These platforms automatically scale resources to meet user demands. However, current serverless systems have long cold starts which worsen with larger DL models and lead to poor performance during bursts of requests. Meanwhile, the demand for larger and larger DL models make it more challenging to deliver an acceptable user experience cost-effectively. While current systems over-provision GPUs to address this issue, they incur high costs in idle resources which greatly reduces the benefit of using a serverless platform.
In this thesis, we introduce Flashpoint, a GPU-based serverless platform that serves DL inferences with low latencies. Flashpoint achieves this by reducing cold start durations, especially for large DL models, making serverless computing feasible for latency-sensitive DL workloads. To reduce cold start durations, Flashpoint reduces download times by sourcing the DL model data from within the compute cluster rather than slow cloud storage. Additionally, Flashpoint minimizes in-cluster network congestion from redundant packet transfers of the same DL model to multiple machines with multicasting. Finally, Flashpoint also reduces cold start durations by automatically partitioning models and deploying them in parallel on multiple machines. The reduced cold start durations achieved by Flashpoint enable the platform to scale resource allocations elastically and complete requests with low latencies without over-provisioning expensive GPU resources.
We perform large-scale data center simulations that were parameterized with measurements our prototype implementations. We evaluate the system using six state-of-the-art DL models ranging from 499 MB to 11 GB in size. We also measure the performance of the system in representative real-world traces from Twitter and Microsoft Azure. Our results in the full-scale simulations show that Flashpoint achieves an arithmetic mean of 93.51% shorter average cold start durations, leading to 75.42% and 66.90% respective reductions in average and 99th percentile end-to-end request latencies across the DL models with the same amount of resources. These results show that Flashpoint boosts the performance of serving DL inferences on a serverless platform without increasing costs
Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
Distributed full-graph training of Graph Neural Networks (GNNs) over large
graphs is bandwidth-demanding and time-consuming. Frequent exchanges of node
features, embeddings and embedding gradients (all referred to as messages)
across devices bring significant communication overhead for nodes with remote
neighbors on other devices (marginal nodes) and unnecessary waiting time for
nodes without remote neighbors (central nodes) in the training graph. This
paper proposes an efficient GNN training system, AdaQP, to expedite distributed
full-graph GNN training. We stochastically quantize messages transferred across
devices to lower-precision integers for communication traffic reduction and
advocate communication-computation parallelization between marginal nodes and
central nodes. We provide theoretical analysis to prove fast training
convergence (at the rate of O(T^{-1}) with T being the total number of training
epochs) and design an adaptive quantization bit-width assignment scheme for
each message based on the analysis, targeting a good trade-off between training
convergence and efficiency. Extensive experiments on mainstream graph datasets
show that AdaQP substantially improves distributed full-graph training's
throughput (up to 3.01 X) with negligible accuracy drop (at most 0.30%) or even
accuracy improvement (up to 0.19%) in most cases, showing significant
advantages over the state-of-the-art works
Tools for efficient Deep Learning
In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption.
We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work.
This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C.
Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets.
All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
- …