3 research outputs found
A Practical Layer-Parallel Training Algorithm for Residual Networks
Gradient-based algorithms for training ResNets typically require a forward
pass of the input data, followed by back-propagating the objective gradient to
update parameters, which are time-consuming for deep ResNets. To break the
dependencies between modules in both the forward and backward modes,
auxiliary-variable methods such as the penalty and augmented Lagrangian (AL)
approaches have attracted much interest lately due to their ability to exploit
layer-wise parallelism. However, we observe that large communication overhead
and lacking data augmentation are two key challenges of these methods, which
may lead to low speedup ratio and accuracy drop across multiple compute
devices. Inspired by the optimal control formulation of ResNets, we propose a
novel serial-parallel hybrid training strategy to enable the use of data
augmentation, together with downsampling filters to reduce the communication
cost. The proposed strategy first trains the network parameters by solving a
succession of independent sub-problems in parallel and then corrects the
network parameters through a full serial forward-backward propagation of data.
Such a strategy can be applied to most of the existing layer-parallel training
methods using auxiliary variables. As an example, we validate the proposed
strategy using penalty and AL methods on ResNet and WideResNet across MNIST,
CIFAR-10 and CIFAR-100 datasets, achieving significant speedup over the
traditional layer-serial training methods while maintaining comparable
accuracy
DessiLBI: Exploring Structural Sparsity of Deep Networks via Differential Inclusion Paths
Over-parameterization is ubiquitous nowadays in training neural networks to
benefit both optimization in seeking global optima and generalization in
reducing prediction error. However, compressive networks are desired in many
real world applications and direct training of small networks may be trapped in
local optima. In this paper, instead of pruning or distilling
over-parameterized models to compressive ones, we propose a new approach based
on differential inclusions of inverse scale spaces. Specifically, it generates
a family of models from simple to complex ones that couples a pair of
parameters to simultaneously train over-parameterized deep models and
structural sparsity on weights of fully connected and convolutional layers.
Such a differential inclusion scheme has a simple discretization, proposed as
Deep structurally splitting Linearized Bregman Iteration (DessiLBI), whose
global convergence analysis in deep learning is established that from any
initializations, algorithmic iterations converge to a critical point of
empirical risks. Experimental evidence shows that DessiLBI achieve comparable
and even better performance than the competitive optimizers in exploring the
structural sparsity of several widely used backbones on the benchmark datasets.
Remarkably, with early stopping, DessiLBI unveils "winning tickets" in early
epochs: the effective sparse structure with comparable test accuracy to fully
trained over-parameterized models.Comment: conference , 23 pages https://github.com/corwinliu9669/dS2LB
Learning DNN networks using un-rectifying ReLU with compressed sensing application
The un-rectifying technique expresses a non-linear point-wise activation
function as a data-dependent variable, which means that the activation variable
along with its input and output can all be employed in optimization. The ReLU
network in this study was un-rectified means that the activation functions
could be replaced with data-dependent activation variables in the form of
equations and constraints. The discrete nature of activation variables
associated with un-rectifying ReLUs allows the reformulation of deep learning
problems as problems of combinatorial optimization. However, we demonstrate
that the optimal solution to a combinatorial optimization problem can be
preserved by relaxing the discrete domains of activation variables to closed
intervals. This makes it easier to learn a network using methods developed for
real-domain constrained optimization. We also demonstrate that by introducing
data-dependent slack variables as constraints, it is possible to optimize a
network based on the augmented Lagrangian approach. This means that our method
could theoretically achieve global convergence and all limit points are
critical points of the learning problem. In experiments, our novel approach to
solving the compressed sensing recovery problem achieved state-of-the-art
performance when applied to the MNIST database and natural images.Comment: 35 pages, 6 figure