217 research outputs found
The Effects of Weight Quantization on Online Federated Learning for the IoT: A Case Study
Many weight quantization approaches were explored to save the communication bandwidth between the clients and the server in federated learning using high-end computing machines. However, there is a lack of weight quantization research for online federated learning using TinyML devices which are restricted by the mini-batch size, the neural network size, and the communication method due to their severe hardware resource constraints and power budgets. We name Tiny Online Federated Learning (TinyOFL) for online federated learning using TinyML devices in the Internet of Things (IoT). This paper performs a comprehensive analysis of the effects of weight quantization in TinyOFL in terms of accuracy, stability, overfitting, communication efficiency, energy consumption, and delivery time, and extracts practical guidelines on how to apply the weight quantization to TinyOFL. Our analysis is supported by a TinyOFL case study with three Arduino Portenta H7 boards running federated learning clients for a keyword spotting task. Our findings include that in TinyOFL, a more aggressive weight quantization can be allowed than in online learning without FL, without affecting the accuracy thanks to TinyOFL’s quasi-batch training property. For example, using 7-bit weights achieved the equivalent accuracy to 32-bit floating point weights, while saving communication bandwidth by 4.6× . Overfitting by increasing network width rarely occurs in TinyOFL, but may occur if strong weight quantization is applied. The experiments also showed that there is a design space for TinyOFL applications by compensating for the accuracy loss due to weight quantization with an increase of the neural network size
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
Instruction tuning large language models (LLMs) remains a challenging task,
owing to the complexity of hyperparameter selection and the difficulty involved
in evaluating the tuned models. To determine the optimal hyperparameters, an
automatic, robust, and reliable evaluation benchmark is essential. However,
establishing such a benchmark is not a trivial task due to the challenges
associated with evaluation accuracy and privacy protection. In response to
these challenges, we introduce a judge large language model, named PandaLM,
which is trained to distinguish the superior model given several LLMs.
PandaLM's focus extends beyond just the objective correctness of responses,
which is the main focus of traditional evaluation datasets. It addresses vital
subjective factors such as relative conciseness, clarity, adherence to
instructions, comprehensiveness, and formality. To ensure the reliability of
PandaLM, we collect a diverse human-annotated test dataset, where all contexts
are generated by humans and labels are aligned with human preferences. Our
results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation
ability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLM
enables the evaluation of LLM to be fairer but with less cost, evidenced by
significant improvements achieved by models tuned through PandaLM compared to
their counterparts trained with default Alpaca's hyperparameters. In addition,
PandaLM does not depend on API-based evaluations, thus avoiding potential data
leakage. All resources of PandaLM are released at
https://github.com/WeOpenML/PandaLM
HyperTrack: Neural Combinatorics for High Energy Physics
Combinatorial inverse problems in high energy physics span enormous
algorithmic challenges. This work presents a new deep learning driven
clustering algorithm that utilizes a space-time non-local trainable graph
constructor, a graph neural network, and a set transformer. The model is
trained with loss functions at the graph node, edge and object level, including
contrastive learning and meta-supervision. The algorithm can be applied to
problems such as charged particle tracking, calorimetry, pile-up
discrimination, jet physics, and beyond. We showcase the effectiveness of this
cutting-edge AI approach through particle tracking simulations. The code is
available online.Comment: CHEP 2023 proceedings. 8 pages (max
Investigating and Testing Performance Issues in Deep Learning Frameworks
Machine Learning (ML) and Deep Learning (DL) applications are becoming more popular due to the availability of DL frameworks such as PyTorch, Keras, and TensorFlow. Therefore, the quality of DL frameworks is essential to ensure DL/ML application quality. Given the computationally expensive nature of DL tasks (e.g., training), performance is a critical aspect of DL frameworks. However, optimizing DL frameworks may have its own unique challenges due to the peculiarities of DL (e.g., hardware integration and the nature of the computation).
In this thesis, we first aim to better understand performance bugs in DL frameworks by conducting an empirical study. We conduct our study on PyTorch and TensorFlow by mining and studying their performance and non-performance bug reports from their respective GitHub repositories. We find that 1) the proportion of newly reported performance bugs increases faster than fixed performance bugs, and the ratio of performance bugs among all bugs increases over time; 2) performance bugs take more time to fix, have larger fix sizes, and more community engagement (e.g., discussion) compared to non-performance bugs; and 3) we manually derived a taxonomy of 12 categories and 19 sub-categories of the root causes of performance bugs in DL frameworks by studying all performance bug fixes.
We then aim to investigate the potential of differential testing as a viable technique to detect and prevent performance bugs in DL frameworks. To do so, we train and evaluate two state-of-the-art CNN and RNN architectures (i.e., the Lenet-5 architecture on the MNIST dataset and the LSTM architecture on the IMDB movie review dataset), using different DL frameworks (i.e., PyTorch, Keras, and TensorFlow), and different configurations (i.e., the training dataset sample size, the batch size, the number of epochs, the weight initialization technique, the data type, the hardware used, the learning rate, and the dropout rate). To assess the performance of the DL models, we use a variety of performance metrics (i.e., training/inference time, hardware (CPU or GPU) usage during training/inference, and memory (RAM or GPU VRAM) usage during training/inference). Then, we compare the performance of the DL models across the DL frameworks. We train and evaluate 21,870 Lenet5 models and 21,870 LSTM models across the DL frameworks, for a grand total of 43,740 models. Our experiments took over 42 days. We find that 1) differences in performance between different DL frameworks, for the same task, may be indicative of a performance optimization opportunity/performance bug; 2) our approach is viable when training and evaluating a smaller number of DL models, which makes it more accessible for developers.
Finally, we present some potential avenues for future work that aim to further study performance bugs in DL frameworks
ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures
Vision Transformers are being increasingly deployed in safety-critical
applications that demand high reliability. It is crucial to ensure the
correctness of their execution in spite of potential errors such as transient
hardware errors. We propose a novel algorithm-based resilience framework called
ALBERTA that allows us to perform end-to-end resilience analysis and protection
of transformer-based architectures. First, our work develops an efficient
process of computing and ranking the resilience of transformers layers. We find
that due to the large size of transformer models, applying traditional network
redundancy to a subset of the most vulnerable layers provides high error
coverage albeit with impractically high overhead. We address this shortcoming
by providing a software-directed, checksum-based error detection technique
aimed at protecting the most vulnerable general matrix multiply (GEMM) layers
in the transformer models that use either floating-point or integer arithmetic.
Results show that our approach achieves over 99% coverage for errors that
result in a mismatch at less than 0.2% computation overhead. Lastly, we present
the applicability of our framework in various modern GPU architectures under
different numerical precisions. We introduce an efficient self-correction
mechanism for resolving erroneous detection with an average overhead of less
than 0.002% (with a 2% overhead to resolve each erroneous detection)
Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
Autoregressive models, despite their commendable performance in a myriad of
generative tasks, face challenges stemming from their inherently sequential
structure. Inference on these models, by design, harnesses a temporal
dependency, where the current token's probability distribution is conditioned
on preceding tokens. This inherent characteristic severely impedes
computational efficiency during inference as a typical inference request can
require more than thousands of tokens, where generating each token requires a
load of entire model weights, making the inference more memory-bound. The large
overhead becomes profound in real deployment where requests arrive randomly,
necessitating various generation lengths. Existing solutions, such as dynamic
batching and concurrent instances, introduce significant response delays and
bandwidth contention, falling short of achieving optimal latency and
throughput. To address these shortcomings, we propose Flover -- a temporal
fusion framework for efficiently inferring multiple requests in parallel. We
deconstruct the general generation pipeline into pre-processing and token
generation, and equip the framework with a dedicated work scheduler for fusing
the generation process temporally across all requests. By orchestrating the
token-level parallelism, Flover exhibits optimal hardware efficiency and
significantly spares the system resources. By further employing a fast buffer
reordering algorithm that allows memory eviction of finished tasks, it brings
over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge
solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the
advanced tensor parallel technique, Flover proves efficacious across diverse
computational landscapes, from single-GPU setups to distributed scenarios,
thereby offering robust performance optimization that adapts to variable use
cases.Comment: In Proceeding of 30th IEEE International Conference on High
Performance Computing, Data, and Analytics (HiPC
Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators
Analog in-memory computing (AIMC) -- a promising approach for
energy-efficient acceleration of deep learning workloads -- computes
matrix-vector multiplications (MVMs) but only approximately, due to
nonidealities that often are non-deterministic or nonlinear. This can adversely
impact the achievable deep neural network (DNN) inference accuracy as compared
to a conventional floating point (FP) implementation. While retraining has
previously been suggested to improve robustness, prior work has explored only a
few DNN topologies, using disparate and overly simplified AIMC hardware models.
Here, we use hardware-aware (HWA) training to systematically examine the
accuracy of AIMC for multiple common artificial intelligence (AI) workloads
across multiple DNN topologies, and investigate sensitivity and robustness to a
broad set of nonidealities. By introducing a new and highly realistic AIMC
crossbar-model, we improve significantly on earlier retraining approaches. We
show that many large-scale DNNs of various topologies, including convolutional
neural networks (CNNs), recurrent neural networks (RNNs), and transformers, can
in fact be successfully retrained to show iso-accuracy on AIMC. Our results
further suggest that AIMC nonidealities that add noise to the inputs or
outputs, not the weights, have the largest impact on DNN accuracy, and that
RNNs are particularly robust to all nonidealities.Comment: 35 pages, 7 figures, 5 table
Tools for efficient Deep Learning
In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption.
We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work.
This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C.
Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets.
All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
Memory Efficient Mixed-Precision Optimizers
Traditional optimization methods rely on the use of single-precision floating
point arithmetic, which can be costly in terms of memory size and computing
power. However, mixed precision optimization techniques leverage the use of
both single and half-precision floating point arithmetic to reduce memory
requirements while maintaining model accuracy. We provide here an algorithm to
further reduce memory usage during the training of a model by getting rid of
the floating point copy of the parameters, virtually keeping only
half-precision numbers. We also explore the benefits of getting rid of the
gradient's value by executing the optimizer step during the back-propagation.
In practice, we achieve up to 25% lower peak memory use and 15% faster training
while maintaining the same level of accuracy
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Self-supervised learning (SSL) is at the origin of unprecedented improvements
in many different domains including computer vision and natural language
processing. Speech processing drastically benefitted from SSL as most of the
current domain-related tasks are now being approached with pre-trained models.
This work introduces LeBenchmark 2.0 an open-source framework for assessing and
building SSL-equipped French speech technologies. It includes documented,
large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous
speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to
one billion learnable parameters shared with the community, and an evaluation
protocol made of six downstream tasks to complement existing benchmarks.
LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for
speech with the investigation of frozen versus fine-tuned downstream models,
task-agnostic versus task-specific pre-trained models as well as a discussion
on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe
- …