14 research outputs found

    SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers

    Full text link
    Transformers' compute-intensive operations pose enormous challenges for their deployment in resource-constrained EdgeAI / tinyML devices. As an established neural network compression technique, quantization reduces the hardware computational and memory resources. In particular, fixed-point quantization is desirable to ease the computations using lightweight blocks, like adders and multipliers, of the underlying hardware. However, deploying fully-quantized Transformers on existing general-purpose hardware, generic AI accelerators, or specialized architectures for Transformers with floating-point units might be infeasible and/or inefficient. Towards this, we propose SwiftTron, an efficient specialized hardware accelerator designed for Quantized Transformers. SwiftTron supports the execution of different types of Transformers' operations (like Attention, Softmax, GELU, and Layer Normalization) and accounts for diverse scaling factors to perform correct computations. We synthesize the complete SwiftTron architecture in a 6565 nm CMOS technology with the ASIC design flow. Our Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64 mW power, and occupying an area of 273 mm^2. To ease the reproducibility, the RTL of our SwiftTron architecture is released at https://github.com/albertomarchisio/SwiftTron.Comment: To appear at the 2023 International Joint Conference on Neural Networks (IJCNN), Queensland, Australia, June 202

    ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

    Full text link
    Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.Comment: 12 page

    Polynomially Over-Parameterized Convolutional Neural Networks Contain Structured Strong Winning Lottery Tickets

    Full text link
    The Strong Lottery Ticket Hypothesis (SLTH) states that randomly-initialised neural networks likely contain subnetworks that perform well without any training. Although unstructured pruning has been extensively studied in this context, its structured counterpart, which can deliver significant computational and memory efficiency gains, has been largely unexplored. One of the main reasons for this gap is the limitations of the underlying mathematical tools used in formal analyses of the SLTH. In this paper, we overcome these limitations: we leverage recent advances in the multidimensional generalisation of the Random Subset-Sum Problem and obtain a variant that admits the stochastic dependencies that arise when addressing structured pruning in the SLTH. We apply this result to prove, for a wide class of random Convolutional Neural Networks, the existence of structured subnetworks that can approximate any sufficiently smaller network. This result provides the first sub-exponential bound around the SLTH for structured pruning, opening up new avenues for further research on the hypothesis and contributing to the understanding of the role of over-parameterization in deep learning.Comment: To be published in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023

    Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

    Full text link
    The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al. [2022]), this approach results in slower convergence rates for convex and over-parameterized models. In this work, we make two contributions: Firstly, we propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which guarantee convergence in non-interpolation settings and maintain sub-linear and linear convergence rates for convex and strongly convex functions when training over-parameterized models. AdaSLS requires no knowledge of problem-dependent parameters, and AdaSPS requires only a lower bound of the optimal function value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance reduction technique and obtain algorithms that require O~(n+1/ϵ)\smash{\widetilde{\mathcal{O}}}(n+1/\epsilon) gradient evaluations to achieve an O(ϵ)\mathcal{O}(\epsilon)-suboptimality for convex functions, which improves upon the slower O(1/ϵ2)\mathcal{O}(1/\epsilon^2) rates of AdaSPS and AdaSLS without variance reduction in the non-interpolation regimes. Moreover, our result matches the fast rates of AdaSVRG but removes the inner-outer-loop structure, which is easier to implement and analyze. Finally, numerical experiments on synthetic and real datasets validate our theory and demonstrate the effectiveness and robustness of our algorithms

    HongTu: Scalable Full-Graph GNN Training on Multiple GPUs (via communication-optimized CPU data offloading)

    Full text link
    Full-graph training on graph neural networks (GNN) has emerged as a promising training method for its effectiveness. Full-graph training requires extensive memory and computation resources. To accelerate this training process, researchers have proposed employing multi-GPU processing. However the scalability of existing frameworks is limited as they necessitate maintaining the training data for every layer in GPU memory. To efficiently train on large graphs, we present HongTu, a scalable full-graph GNN training system running on GPU-accelerated platforms. HongTu stores vertex data in CPU memory and offloads training to GPUs. HongTu employs a memory-efficient full-graph training framework that reduces runtime memory consumption by using partition-based training and recomputation-caching-hybrid intermediate data management. To address the issue of increased host-GPU communication caused by duplicated neighbor access among partitions, HongTu employs a deduplicated communication framework that converts the redundant host-GPU communication to efficient inter/intra-GPU data access. Further, HongTu uses a cost model-guided graph reorganization method to minimize communication overhead. Experimental results on a 4XA100 GPU server show that HongTu effectively supports billion-scale full-graph GNN training while reducing host-GPU data communication by 25%-71%. Compared to the full-graph GNN system DistGNN running on 16 CPU nodes, HongTu achieves speedups ranging from 7.8X to 20.2X. For small graphs where the training data fits into the GPUs, HongTu achieves performance comparable to existing GPU-based GNN systems.Comment: 28 pages 11 figures, SIGMOD202

    Memorization-Dilation: Modeling Neural Collapse Under Label Noise

    Full text link
    The notion of neural collapse refers to several emergent phenomena that have been empirically observed across various canonical classification problems. During the terminal phase of training a deep neural network, the feature embedding of all examples of the same class tend to collapse to a single representation, and the features of different classes tend to separate as much as possible. Neural collapse is often studied through a simplified model, called the unconstrained feature representation, in which the model is assumed to have "infinite expressivity" and can map each data point to any arbitrary representation. In this work, we propose a more realistic variant of the unconstrained feature representation that takes the limited expressivity of the network into account. Empirical evidence suggests that the memorization of noisy data points leads to a degradation (dilation) of the neural collapse. Using a model of the memorization-dilation (M-D) phenomenon, we show one mechanism by which different losses lead to different performances of the trained network on noisy data. Our proofs reveal why label smoothing, a modification of cross-entropy empirically observed to produce a regularization effect, leads to improved generalization in classification tasks.Comment: to be published at ICLR 202

    Класифікація гістологічних зображень раку простати

    Get PDF
    Магістерська дисертація за темою «Класифікація гістологічних зображень раку простати» виконана студентом кафедри біомедичної кібернетики ФБМІ Ілюшиком Тарасом Сергійовичем зі спеціальності 122 «Комп’ютерні науки» за освітньо-професійною програмою «Комп’ютерні технології в біології та медицині» та складається зі: вступу; 4 розділів (Огляд літературних джерел, Теоретична частина, Аналітична частина, Практична частина), розділу зі стартап проєкту, висновків до кожного з цих розділів; загальних висновків; списку використаних джерел, який налічує 85 джерел. Загальний обсяг роботи 97 сторінок. Актуальність теми. Актуальність полягає в тому, що системи комп’ютерної автоматизованої діагностики є перспективною галуззю в медичній кібернетиці і дають змогу підвищити якість визначення патологій людини, таких як рак простати. Мета дослідження. Розробка моделі для класифікації гістологічних зображень раку простати. Об’єкт дослідження. Алгоритми для класифікації зображень. Предмет дослідження. Рак простати. Методи дослідження. Комп’ютерний зір, згорткові нейронній мережі, методи машинного навчання.Master's dissertation on "Classification of histological images of prostate cancer" performed by Iliushyk Taras, a student of the Department of Biomedical Cybernetics FBMI by specialty 122 "Computer Science" in the educational and professional program "Computer Technology in Biology and Medicine" and consists of an introduction, 4 chapters (Review of literature sources, Theoretical part, Analytical part, Practical part), conclusions to each chapter, general conclusions and list of references that includes 85 points. The paper amounts to 97 pages. Topic’s relevance. The relevance is that computer-aided diagnostic systems are a promising field in medical cybernetics and can improve the quality of human pathologies such as prostate cancer. Research objective. Development of a model for the classification of histological images of prostate cancer. Object of study. Algorithms for image classification. Subject of study. Prostate cancer. Research methods. Computer vision, convolutional neural networks, machine learning techniques
    corecore