34 research outputs found
FetusMapV2: Enhanced Fetal Pose Estimation in 3D Ultrasound
Fetal pose estimation in 3D ultrasound (US) involves identifying a set of
associated fetal anatomical landmarks. Its primary objective is to provide
comprehensive information about the fetus through landmark connections, thus
benefiting various critical applications, such as biometric measurements, plane
localization, and fetal movement monitoring. However, accurately estimating the
3D fetal pose in US volume has several challenges, including poor image
quality, limited GPU memory for tackling high dimensional data, symmetrical or
ambiguous anatomical structures, and considerable variations in fetal poses. In
this study, we propose a novel 3D fetal pose estimation framework (called
FetusMapV2) to overcome the above challenges. Our contribution is three-fold.
First, we propose a heuristic scheme that explores the complementary network
structure-unconstrained and activation-unreserved GPU memory management
approaches, which can enlarge the input image resolution for better results
under limited GPU memory. Second, we design a novel Pair Loss to mitigate
confusion caused by symmetrical and similar anatomical structures. It separates
the hidden classification task from the landmark localization task and thus
progressively eases model learning. Last, we propose a shape priors-based
self-supervised learning by selecting the relatively stable landmarks to refine
the pose online. Extensive experiments and diverse applications on a
large-scale fetal US dataset including 1000 volumes with 22 landmarks per
volume demonstrate that our method outperforms other strong competitors.Comment: 16 pages, 11 figures, accepted by Medical Image Analysis(2023
XEngine : Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments
Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained
heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of
these checkpoints is a non-trivial problem and poses a challenge to the programmer—improper or excessive
recomputations negate the benefit of checkpointing.
In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices
in low memory environments by determining checkpoints and recomputations of tensors. Our approach
selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks
taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic
program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare
our MIQP solver XEngine against Checkmate [12], a mixed-integer linear programming (MILP) approach
that solves recomputation on a single device. Our solver finds solutions that are up to 22.5% faster than the
fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find
valid schedules for networks making use of both central processing units and graphics processing units if
memory limitations do not allow scheduling exclusively to the graphics processing unit
Parallel Architectures for Planetary Exploration Requirements (PAPER)
The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified
Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection
Recent works within machine learning have been tackling inputs of
ever-increasing size, with cybersecurity presenting sequence classification
problems of particularly extreme lengths. In the case of Windows executable
malware detection, inputs may exceed MB, which corresponds to a time
series with steps. To date, the closest approach to handling
such a task is MalConv, a convolutional neural network capable of processing up
to steps. The memory of CNNs has prevented
further application of CNNs to malware. In this work, we develop a new approach
to temporal max pooling that makes the required memory invariant to the
sequence length . This makes MalConv more memory efficient, and
up to faster to train on its original dataset, while removing the
input length restrictions to MalConv. We re-invest these gains into improving
the MalConv architecture by developing a new Global Channel Gating design,
giving us an attention mechanism capable of learning feature interactions
across 100 million time steps in an efficient manner, a capability lacked by
the original MalConv CNN. Our implementation can be found at
https://github.com/NeuromorphicComputationResearchProgram/MalConv2Comment: To appear in AAAI 202
Rockmate: an Efficient, Fast, Automatic and Generic Tool for Re-materialization in PyTorch
We propose Rockmate to control the memory requirements when training PyTorch
DNN models. Rockmate is an automatic tool that starts from the model code and
generates an equivalent model, using a predefined amount of memory for
activations, at the cost of a few re-computations. Rockmate automatically
detects the structure of computational and data dependencies and rewrites the
initial model as a sequence of complex blocks. We show that such a structure is
widespread and can be found in many models in the literature (Transformer based
models, ResNet, RegNets,...). This structure allows us to solve the problem in
a fast and efficient way, using an adaptation of Checkmate (too slow on the
whole model but general) at the level of individual blocks and an adaptation of
Rotor (fast but limited to sequential models) at the level of the sequence
itself. We show through experiments on many models that Rockmate is as fast as
Rotor and as efficient as Checkmate, and that it allows in many cases to obtain
a significantly lower memory consumption for activations (by a factor of 2 to
5) for a rather negligible overhead (of the order of 10% to 20%). Rockmate is
open source and available at https://github.com/topal-team/rockmate
Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training
International audienceTraining Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored until the corresponding gradient has been computed in the backward phase. These memory requirements sometimes prevent to consider larger batch sizes and deeper networks, so that they can limit both convergence speed and accuracy. Recent works have proposed to offload some of the computed forward activations from the memory of the GPU to the memory of the CPU. This requires to determine which activations should be offloaded and when these transfers from and to the memory of the GPU should take place. We prove that this problem is NP-hard in the strong sense, and we propose two heuristics based on relaxations of the problem. We perform extensive experimental evaluation on standard Deep Neural Networks. We compare the performance of our heuristics against previous approaches from the literature, showing that they achieve much better performance in a wide variety of situations
Une approche fondée sur la programmation linéaire pour le parallélisme de modèle
The training phase in Deep Neural Networks has become an important source of computing resource usage and because of the resulting volume of computation, it is crucial to perform it efficiently on parallel architectures. Even today, data parallelism is the most widely used method, but the associated requirement to replicate all the weights on the totality of computation resources poses problems of memory at the level of each node and of collective communications at the level of the platform. In this context, the model parallelism, which consists in distributing the different layers of the network over the computing nodes, is an attractive alternative. Indeed, it is expected to better distribute weights (to cope with memory problems) and it does not imply large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined/streaming approach, which leads in turn to new memory costs. The goal of this paper is to model these memory costs in detail and to show that it is possible to formalize this optimization problem as an Integer Linear Program (ILP).La phase d’apprentissage dans les réseaux neuronaux profonds est devenue une source importante d’utilisation des ressources de calcul et, en raison du volume de calcul qui en résulte, il est crucial de l’exécuter efficacement sur des architectures parallèles. Aujourd’hui encore, le parallélisme de données est la méthode la plus utilisée, mais l’exigence associée de répliquer tous les poids sur la totalité des ressources de calcul pose des problèmes de mémoire au niveau de chaque nœud et de communications collectives au niveau de la plateforme. Dans ce contexte, le parallélisme de modèle, qui consiste à répartir les différentes couches du réseau sur les nœuds de calcul, est une alternative intéressante. En effet, il est censé mieux répartir les poids (pour faire face aux problèmes de mémoire) et il n’implique pas de grosses communications collectives puisque seules les activations "forward" sont communiquées. Cependant, pour être efficace, elle doit être combinée avec une approche pipelinée/streaming, ce qui entraîne à son tour de nouveaux coûts mémoire. L’objectif de cet article est de modéliser ces coûts de mémoire en détail et de montrer qu’il est possible de formaliser ce problème d’optimisation comme un programme linéaire en nombre entier (ILP)