146 research outputs found
A Comprehensive Survey on Distributed Training of Graph Neural Networks
Graph neural networks (GNNs) have been demonstrated to be a powerful
algorithmic model in broad application fields for their effectiveness in
learning over graphs. To scale GNN training up for large-scale and ever-growing
graphs, the most promising solution is distributed training which distributes
the workload of training across multiple computing nodes. At present, the
volume of related research on distributed GNN training is exceptionally vast,
accompanied by an extraordinarily rapid pace of publication. Moreover, the
approaches reported in these studies exhibit significant divergence. This
situation poses a considerable challenge for newcomers, hindering their ability
to grasp a comprehensive understanding of the workflows, computational
patterns, communication strategies, and optimization techniques employed in
distributed GNN training. As a result, there is a pressing need for a survey to
provide correct recognition, analysis, and comparisons in this field. In this
paper, we provide a comprehensive survey of distributed GNN training by
investigating various optimization techniques used in distributed GNN training.
First, distributed GNN training is classified into several categories according
to their workflows. In addition, their computational patterns and communication
patterns, as well as the optimization techniques proposed by recent work are
introduced. Second, the software frameworks and hardware platforms of
distributed GNN training are also introduced for a deeper understanding. Third,
distributed GNN training is compared with distributed training of deep neural
networks, emphasizing the uniqueness of distributed GNN training. Finally,
interesting issues and opportunities in this field are discussed.Comment: To Appear in Proceedings of the IEE
Proteus: Simulating the Performance of Distributed DNN Training
DNN models are becoming increasingly larger to achieve unprecedented
accuracy, and the accompanying increased computation and memory requirements
necessitate the employment of massive clusters and elaborate parallelization
strategies to accelerate DNN training. In order to better optimize the
performance and analyze the cost, it is indispensable to model the training
throughput of distributed DNN training. However, complex parallelization
strategies and the resulting complex runtime behaviors make it challenging to
construct an accurate performance model. In this paper, we present Proteus, the
first standalone simulator to model the performance of complex parallelization
strategies through simulation execution. Proteus first models complex
parallelization strategies with a unified representation named Strategy Tree.
Then, it compiles the strategy tree into a distributed execution graph and
simulates the complex runtime behaviors, comp-comm overlap and bandwidth
sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate
Proteus across a wide variety of DNNs on three hardware configurations.
Experimental results show that Proteus achieves average prediction
error and preserves order for training throughput of various parallelization
strategies. Compared to state-of-the-art approaches, Proteus reduces prediction
error by up to
The future of computing beyond Moore's Law.
Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore's Law for 50 years is failing and is anticipated to flatten by 2025. This article provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this article covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Closing the Gap for Pseudo-Polynomial Strip Packing
Two-dimensional packing problems are a fundamental class of optimization problems and Strip Packing is one of the most natural and famous among them. Indeed it can be defined in just one sentence: Given a set of rectangular axis parallel items and a strip with bounded width and infinite height, the objective is to find a packing of the items into the strip minimizing the packing height. We speak of pseudo-polynomial Strip Packing if we consider algorithms with pseudo-polynomial running time with respect to the width of the strip. It is known that there is no pseudo-polynomial time algorithm for Strip Packing with a ratio better than 5/4 unless P = NP. The best algorithm so far has a ratio of 4/3 + epsilon. In this paper, we close the gap between inapproximability result and currently known algorithms by presenting an algorithm with approximation ratio 5/4 + epsilon. The algorithm relies on a new structural result which is the main accomplishment of this paper. It states that each optimal solution can be transformed with bounded loss in the objective such that it has one of a polynomial number of different forms thus making the problem tractable by standard techniques, i.e., dynamic programming. To show the conceptual strength of the approach, we extend our result to other problems as well, e.g., Strip Packing with 90 degree rotations and Contiguous Moldable Task Scheduling, and present algorithms with approximation ratio 5/4 + epsilon for these problems as well
Towards efficient end-to-end encryption for container checkpointing systems
Container checkpointing has emerged as a new paradigm for task migration, preemptive scheduling and elastic scaling of microservices. However, as soon as a snapshot that contains raw memory is exposed through the network or shared storage, sensitive data such as keys and passwords may become compromised. Existing solutions rely on encryption to protect data included in snapshots but by doing so prevent important performance optimizations such as memory de-duplication and incremental checkpointing. To address these challenges, we design and implement CRIUsec, an efficient end-to-end encryption scheme for container checkpointing systems built on the open-source CRIU (Checkpoint/Restore In Userspace). Our preliminary evaluation shows that CRIUsec integrates seamlessly with popular container platforms (Docker, Podman, Kubernetes), and compared to existing solutions, achieves an average of 1.57× speedup for memory-intensive workloads, and can be up to 100× faster for compute-intensive workloads
- …