322 research outputs found
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
Large language models (LLMs) with hundreds of billions or trillions of
parameters, represented by chatGPT, have achieved profound impact on various
fields. However, training LLMs with super-large-scale parameters requires large
high-performance GPU clusters and long training periods lasting for months. Due
to the inevitable hardware and software failures in large-scale clusters,
maintaining uninterrupted and long-duration training is extremely challenging.
As a result, A substantial amount of training time is devoted to task
checkpoint saving and loading, task rescheduling and restart, and task manual
anomaly checks, which greatly harms the overall training efficiency. To address
these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
In this work, we design three key subsystems: the training pipeline automatic
fault tolerance and recovery mechanism named Transom Operator and Launcher
(TOL), the training task multi-dimensional metric automatic anomaly detection
system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous
access automatic fault tolerance and recovery technology named Transom
Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks,
while TEE is responsible for task monitoring and anomaly reporting. TEE detects
training anomalies and reports them to TOL, who automatically enters the fault
tolerance strategy to eliminate abnormal nodes and restart the training task.
And the asynchronous checkpoint saving and loading functionality provided by
TCE greatly shorten the fault tolerance overhead. The experimental results
indicate that TRANSOM significantly enhances the efficiency of large-scale LLM
training on clusters. Specifically, the pre-training time for GPT3-175B has
been reduced by 28%, while checkpoint saving and loading performance have
improved by a factor of 20.Comment: 14 pages, 9 figure
A recommender system for e-retail
The e-retail sector in South Africa has a significant opportunity to capture a large portion of the country's retail industry. Central to seizing this opportunity is leveraging the advantages that the online setting affords. In particular, the e-retailer can offer an extremely large catalogue of products; far beyond what a traditional retailer is capable of supporting. However, as the catalogue grows, it becomes increasingly difficult for a customer to efficiently discover desirable products. As a consequence, it is important for the e-retailer to develop tools that automatically explore the catalogue for the customer. In this dissertation, we develop a recommender system (RS), whose purpose is to provide suggestions for products that are most likely of interest to a particular customer. There are two primary contributions of this dissertation. First, we describe a set of six characteristics that all effective RS's should possess, namely; accuracy, responsiveness, durability, scalability, model management, and extensibility. Second, we develop an RS that is capable of serving recommendations in an actual e-retail environment. The design of the RS is an attempt to embody the characteristics mentioned above. In addition, to show how the RS supports model selection, we present a proof-of-concept experiment comparing two popular methods for generating recommendations that we implement for this dissertation, namely, implicit matrix factorisation (IMF) and Bayesian personalised ranking (BPR)
Design Space Exploration and Resource Management of Multi/Many-Core Systems
The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
Understanding Quantum Technologies 2022
Understanding Quantum Technologies 2022 is a creative-commons ebook that
provides a unique 360 degrees overview of quantum technologies from science and
technology to geopolitical and societal issues. It covers quantum physics
history, quantum physics 101, gate-based quantum computing, quantum computing
engineering (including quantum error corrections and quantum computing
energetics), quantum computing hardware (all qubit types, including quantum
annealing and quantum simulation paradigms, history, science, research,
implementation and vendors), quantum enabling technologies (cryogenics, control
electronics, photonics, components fabs, raw materials), quantum computing
algorithms, software development tools and use cases, unconventional computing
(potential alternatives to quantum and classical computing), quantum
telecommunications and cryptography, quantum sensing, quantum technologies
around the world, quantum technologies societal impact and even quantum fake
sciences. The main audience are computer science engineers, developers and IT
specialists as well as quantum scientists and students who want to acquire a
global view of how quantum technologies work, and particularly quantum
computing. This version is an extensive update to the 2021 edition published in
October 2021.Comment: 1132 pages, 920 figures, Letter forma
- …