46 research outputs found
XEngine : Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments
Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained
heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of
these checkpoints is a non-trivial problem and poses a challenge to the programmer—improper or excessive
recomputations negate the benefit of checkpointing.
In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices
in low memory environments by determining checkpoints and recomputations of tensors. Our approach
selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks
taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic
program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare
our MIQP solver XEngine against Checkmate [12], a mixed-integer linear programming (MILP) approach
that solves recomputation on a single device. Our solver finds solutions that are up to 22.5% faster than the
fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find
valid schedules for networks making use of both central processing units and graphics processing units if
memory limitations do not allow scheduling exclusively to the graphics processing unit
POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging
Fine-tuning models on edge devices like mobile phones would enable
privacy-preserving personalization over sensitive data. However, edge training
has historically been limited to relatively small models with simple
architectures because training is both memory and energy intensive. We present
POET, an algorithm to enable training large neural networks on memory-scarce
battery-operated edge devices. POET jointly optimizes the integrated search
search spaces of rematerialization and paging, two algorithms to reduce the
memory consumption of backpropagation. Given a memory budget and a run-time
constraint, we formulate a mixed-integer linear program (MILP) for
energy-optimal training. Our approach enables training significantly larger
models on embedded devices while reducing energy consumption while not
modifying mathematical correctness of backpropagation. We demonstrate that it
is possible to fine-tune both ResNet-18 and BERT within the memory constraints
of a Cortex-M class embedded device while outperforming current edge training
methods in energy efficiency. POET is an open-source project available at
https://github.com/ShishirPatil/poetComment: Proceedings of the 39th International Conference on Machine Learning
2022 (ICML 2022
An Evaluation of Memory Optimization Methods for Training Neural Networks
As models continue to grow in size, the development of memory optimization
methods (MOMs) has emerged as a solution to address the memory bottleneck
encountered when training large models. To comprehensively examine the
practical value of various MOMs, we have conducted a thorough analysis of
existing literature from a systems perspective. Our analysis has revealed a
notable challenge within the research community: the absence of standardized
metrics for effectively evaluating the efficacy of MOMs. The scarcity of
informative evaluation metrics hinders the ability of researchers and
practitioners to compare and benchmark different approaches reliably.
Consequently, drawing definitive conclusions and making informed decisions
regarding the selection and application of MOMs becomes a challenging endeavor.
To address the challenge, this paper summarizes the scenarios in which MOMs
prove advantageous for model training. We propose the use of distinct
evaluation metrics under different scenarios. By employing these metrics, we
evaluate the prevailing MOMs and find that their benefits are not universal. We
present insights derived from experiments and discuss the circumstances in
which they can be advantageous
Rockmate: an Efficient, Fast, Automatic and Generic Tool for Re-materialization in PyTorch
We propose Rockmate to control the memory requirements when training PyTorch
DNN models. Rockmate is an automatic tool that starts from the model code and
generates an equivalent model, using a predefined amount of memory for
activations, at the cost of a few re-computations. Rockmate automatically
detects the structure of computational and data dependencies and rewrites the
initial model as a sequence of complex blocks. We show that such a structure is
widespread and can be found in many models in the literature (Transformer based
models, ResNet, RegNets,...). This structure allows us to solve the problem in
a fast and efficient way, using an adaptation of Checkmate (too slow on the
whole model but general) at the level of individual blocks and an adaptation of
Rotor (fast but limited to sequential models) at the level of the sequence
itself. We show through experiments on many models that Rockmate is as fast as
Rotor and as efficient as Checkmate, and that it allows in many cases to obtain
a significantly lower memory consumption for activations (by a factor of 2 to
5) for a rather negligible overhead (of the order of 10% to 20%). Rockmate is
open source and available at https://github.com/topal-team/rockmate
Efficient Combination of Rematerialization and Offloading for Training DNNs
International audienceRematerialization and offloading are two well known strategies to save memory during the training phase of deep neural networks, allowing data scientists to consider larger models, batch sizes or higher resolution data. Rematerialization trades memory for computation time, whereas Offloading trades memory for data movements. As these two resources are independent, it is appealing to consider the simultaneous combination of both strategies to save even more memory. We precisely model the costs and constraints corresponding to Deep Learning frameworks such as PyTorch or Tensorflow, we propose optimal algorithms to find a valid sequence of memory-constrained operations and finally, we evaluate the performance of proposed algorithms on realistic networks and computation platforms. Our experiments show that the possibility to offload can remove one third of the overhead of rematerialization, and that together they can reduce the memory used for activations by a factor 4 to 6, with an overhead below 20%