14 research outputs found
SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers
Transformers' compute-intensive operations pose enormous challenges for their
deployment in resource-constrained EdgeAI / tinyML devices. As an established
neural network compression technique, quantization reduces the hardware
computational and memory resources. In particular, fixed-point quantization is
desirable to ease the computations using lightweight blocks, like adders and
multipliers, of the underlying hardware. However, deploying fully-quantized
Transformers on existing general-purpose hardware, generic AI accelerators, or
specialized architectures for Transformers with floating-point units might be
infeasible and/or inefficient.
Towards this, we propose SwiftTron, an efficient specialized hardware
accelerator designed for Quantized Transformers. SwiftTron supports the
execution of different types of Transformers' operations (like Attention,
Softmax, GELU, and Layer Normalization) and accounts for diverse scaling
factors to perform correct computations. We synthesize the complete SwiftTron
architecture in a nm CMOS technology with the ASIC design flow. Our
Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64
mW power, and occupying an area of 273 mm^2. To ease the reproducibility, the
RTL of our SwiftTron architecture is released at
https://github.com/albertomarchisio/SwiftTron.Comment: To appear at the 2023 International Joint Conference on Neural
Networks (IJCNN), Queensland, Australia, June 202
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large
language models on massive GPUs clusters due to its ease of use, efficiency,
and good scalability. However, when training on low-bandwidth clusters, or at
scale which forces batch size per GPU to be small, ZeRO's effective throughput
is limited because of high communication volume from gathering weights in
forward pass, backward pass, and averaging gradients. This paper introduces
three communication volume reduction techniques, which we collectively refer to
as ZeRO++, targeting each of the communication collectives in ZeRO. First is
block-quantization based all-gather. Second is data remapping that trades-off
communication for more memory. Third is a novel all-to-all based quantized
gradient averaging paradigm as replacement of reduce-scatter collective, which
preserves accuracy despite communicating low precision data. Collectively,
ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better
throughput at 384 GPU scale.Comment: 12 page
Polynomially Over-Parameterized Convolutional Neural Networks Contain Structured Strong Winning Lottery Tickets
The Strong Lottery Ticket Hypothesis (SLTH) states that randomly-initialised
neural networks likely contain subnetworks that perform well without any
training. Although unstructured pruning has been extensively studied in this
context, its structured counterpart, which can deliver significant
computational and memory efficiency gains, has been largely unexplored. One of
the main reasons for this gap is the limitations of the underlying mathematical
tools used in formal analyses of the SLTH. In this paper, we overcome these
limitations: we leverage recent advances in the multidimensional generalisation
of the Random Subset-Sum Problem and obtain a variant that admits the
stochastic dependencies that arise when addressing structured pruning in the
SLTH. We apply this result to prove, for a wide class of random Convolutional
Neural Networks, the existence of structured subnetworks that can approximate
any sufficiently smaller network.
This result provides the first sub-exponential bound around the SLTH for
structured pruning, opening up new avenues for further research on the
hypothesis and contributing to the understanding of the role of
over-parameterization in deep learning.Comment: To be published in the 37th Conference on Neural Information
Processing Systems (NeurIPS 2023
Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction
The recently proposed stochastic Polyak stepsize (SPS) and stochastic
line-search (SLS) for SGD have shown remarkable effectiveness when training
over-parameterized models. However, in non-interpolation settings, both
algorithms only guarantee convergence to a neighborhood of a solution which may
result in a worse output than the initial guess. While artificially decreasing
the adaptive stepsize has been proposed to address this issue (Orvieto et al.
[2022]), this approach results in slower convergence rates for convex and
over-parameterized models. In this work, we make two contributions: Firstly, we
propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which
guarantee convergence in non-interpolation settings and maintain sub-linear and
linear convergence rates for convex and strongly convex functions when training
over-parameterized models. AdaSLS requires no knowledge of problem-dependent
parameters, and AdaSPS requires only a lower bound of the optimal function
value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance
reduction technique and obtain algorithms that require
gradient evaluations to achieve
an -suboptimality for convex functions, which improves
upon the slower rates of AdaSPS and AdaSLS without
variance reduction in the non-interpolation regimes. Moreover, our result
matches the fast rates of AdaSVRG but removes the inner-outer-loop structure,
which is easier to implement and analyze. Finally, numerical experiments on
synthetic and real datasets validate our theory and demonstrate the
effectiveness and robustness of our algorithms
HongTu: Scalable Full-Graph GNN Training on Multiple GPUs (via communication-optimized CPU data offloading)
Full-graph training on graph neural networks (GNN) has emerged as a promising
training method for its effectiveness. Full-graph training requires extensive
memory and computation resources. To accelerate this training process,
researchers have proposed employing multi-GPU processing. However the
scalability of existing frameworks is limited as they necessitate maintaining
the training data for every layer in GPU memory. To efficiently train on large
graphs, we present HongTu, a scalable full-graph GNN training system running on
GPU-accelerated platforms. HongTu stores vertex data in CPU memory and offloads
training to GPUs. HongTu employs a memory-efficient full-graph training
framework that reduces runtime memory consumption by using partition-based
training and recomputation-caching-hybrid intermediate data management. To
address the issue of increased host-GPU communication caused by duplicated
neighbor access among partitions, HongTu employs a deduplicated communication
framework that converts the redundant host-GPU communication to efficient
inter/intra-GPU data access. Further, HongTu uses a cost model-guided graph
reorganization method to minimize communication overhead. Experimental results
on a 4XA100 GPU server show that HongTu effectively supports billion-scale
full-graph GNN training while reducing host-GPU data communication by 25%-71%.
Compared to the full-graph GNN system DistGNN running on 16 CPU nodes, HongTu
achieves speedups ranging from 7.8X to 20.2X. For small graphs where the
training data fits into the GPUs, HongTu achieves performance comparable to
existing GPU-based GNN systems.Comment: 28 pages 11 figures, SIGMOD202
Memorization-Dilation: Modeling Neural Collapse Under Label Noise
The notion of neural collapse refers to several emergent phenomena that have
been empirically observed across various canonical classification problems.
During the terminal phase of training a deep neural network, the feature
embedding of all examples of the same class tend to collapse to a single
representation, and the features of different classes tend to separate as much
as possible. Neural collapse is often studied through a simplified model,
called the unconstrained feature representation, in which the model is assumed
to have "infinite expressivity" and can map each data point to any arbitrary
representation. In this work, we propose a more realistic variant of the
unconstrained feature representation that takes the limited expressivity of the
network into account. Empirical evidence suggests that the memorization of
noisy data points leads to a degradation (dilation) of the neural collapse.
Using a model of the memorization-dilation (M-D) phenomenon, we show one
mechanism by which different losses lead to different performances of the
trained network on noisy data. Our proofs reveal why label smoothing, a
modification of cross-entropy empirically observed to produce a regularization
effect, leads to improved generalization in classification tasks.Comment: to be published at ICLR 202
Класифікація гістологічних зображень раку простати
Магістерська дисертація за темою «Класифікація гістологічних зображень раку простати» виконана студентом кафедри біомедичної кібернетики ФБМІ Ілюшиком Тарасом Сергійовичем зі спеціальності 122 «Комп’ютерні науки» за освітньо-професійною програмою «Комп’ютерні технології в біології та медицині» та складається зі: вступу; 4 розділів (Огляд літературних джерел, Теоретична частина, Аналітична частина, Практична частина), розділу зі стартап проєкту, висновків до кожного з цих розділів; загальних висновків; списку використаних джерел, який налічує 85 джерел. Загальний обсяг роботи 97 сторінок.
Актуальність теми. Актуальність полягає в тому, що системи комп’ютерної автоматизованої діагностики є перспективною галуззю в медичній кібернетиці і дають змогу підвищити якість визначення патологій людини, таких як рак простати.
Мета дослідження. Розробка моделі для класифікації гістологічних зображень раку простати.
Об’єкт дослідження. Алгоритми для класифікації зображень.
Предмет дослідження. Рак простати.
Методи дослідження. Комп’ютерний зір, згорткові нейронній мережі, методи машинного навчання.Master's dissertation on "Classification of histological images of prostate cancer" performed by Iliushyk Taras, a student of the Department of Biomedical Cybernetics FBMI by specialty 122 "Computer Science" in the educational and professional program "Computer Technology in Biology and Medicine" and consists of an introduction, 4 chapters (Review of literature sources, Theoretical part, Analytical part, Practical part), conclusions to each chapter, general conclusions and list of references that includes 85 points. The paper amounts to 97 pages.
Topic’s relevance. The relevance is that computer-aided diagnostic systems are a promising field in medical cybernetics and can improve the quality of human pathologies such as prostate cancer.
Research objective. Development of a model for the classification of histological images of prostate cancer.
Object of study. Algorithms for image classification.
Subject of study. Prostate cancer.
Research methods. Computer vision, convolutional neural networks, machine learning techniques