61 research outputs found
Novel neural architectures & algorithms for efficient inference
In the last decade, the machine learning universe embraced deep neural networks (DNNs) wholeheartedly with the advent of neural architectures such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), transformers, etc. These models have empowered many applications, such as ChatGPT, Imagen, etc., and have achieved state-of-the-art (SOTA) performance on many vision, speech, and language modeling tasks. However, SOTA performance comes with various issues, such as large model size, compute-intensive training, increased inference latency, higher working memory, etc. This thesis aims at improving the resource efficiency of neural architectures, i.e., significantly reducing the computational, storage, and energy consumption of a DNN without any significant loss in performance.
Towards this goal, we explore novel neural architectures as well as training algorithms that allow low-capacity models to achieve near SOTA performance. We divide this thesis into two dimensions: \textit{Efficient Low Complexity Models}, and \textit{Input Hardness Adaptive Models}.
Along the first dimension, i.e., \textit{Efficient Low Complexity Models}, we improve DNN performance by addressing instabilities in the existing architectures and training methods. We propose novel neural architectures inspired by ordinary differential equations (ODEs) to reinforce input signals and attend to salient feature regions. In addition, we show that carefully designed training schemes improve the performance of existing neural networks. We divide this exploration into two parts:
\textsc{(a) Efficient Low Complexity RNNs.} We improve RNN resource efficiency by addressing poor gradients, noise amplifications, and BPTT training issues. First, we improve RNNs by solving ODEs that eliminate vanishing and exploding gradients during the training. To do so, we present Incremental Recurrent Neural Networks (iRNNs) that keep track of increments in the equilibrium surface. Next, we propose Time Adaptive RNNs that mitigate the noise propagation issue in RNNs by modulating the time constants in the ODE-based transition function. We empirically demonstrate the superiority of ODE-based neural architectures over existing RNNs. Finally, we propose Forward Propagation Through Time (FPTT) algorithm for training RNNs. We show that FPTT yields significant gains compared to the more conventional Backward Propagation Through Time (BPTT) scheme.
\textsc{(b) Efficient Low Complexity CNNs.} Next, we improve CNN architectures by reducing their resource usage. They require greater depth to generate high-level features, resulting in computationally expensive models. We design a novel residual block, the Global layer, that constrains the input and output features by approximately solving partial differential equations (PDEs). It yields better receptive fields than traditional convolutional blocks and thus results in shallower networks. Further, we reduce the model footprint by enforcing a novel inductive bias that formulates the output of a residual block as a spatial interpolation between high-compute anchor pixels and low-compute cheaper pixels. This results in spatially interpolated convolutional blocks (SI-CNNs) that have better compute and performance trade-offs. Finally, we propose an algorithm that enforces various distributional constraints during training in order to achieve better generalization. We refer to this scheme as distributionally constrained learning (DCL).
In the second dimension, i.e., \textit{Input Hardness Adaptive Models}, we introduce the notion of the hardness of any input relative to any architecture. In the first dimension, a neural network allocates the same resources, such as compute, storage, and working memory, for all the inputs. It inherently assumes that all examples are equally hard for a model. In this dimension, we challenge this assumption using input hardness as our reasoning that some inputs are relatively easy for a network to predict compared to others. Input hardness enables us to create selective classifiers wherein a low-capacity network handles simple inputs while abstaining from a prediction on the complex inputs. Next, we create hybrid models that route the hard inputs from the low-capacity abstaining network to a high-capacity expert model. We design various architectures that adhere to this hybrid inference style. Further, input hardness enables us to selectively distill the knowledge of a high-capacity model into a low-capacity model by cleverly discarding hard inputs during the distillation procedure.
Finally, we conclude this thesis by sketching out various interesting future research directions that emerge as an extension of different ideas explored in this work
A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold
Although Deep Learning (DL) has achieved success in complex Artificial
Intelligence (AI) tasks, it suffers from various notorious problems (e.g.,
feature redundancy, and vanishing or exploding gradients), since updating
parameters in Euclidean space cannot fully exploit the geometric structure of
the solution space. As a promising alternative solution, Riemannian-based DL
uses geometric optimization to update parameters on Riemannian manifolds and
can leverage the underlying geometric information. Accordingly, this article
presents a comprehensive survey of applying geometric optimization in DL. At
first, this article introduces the basic procedure of the geometric
optimization, including various geometric optimizers and some concepts of
Riemannian manifold. Subsequently, this article investigates the application of
geometric optimization in different DL networks in various AI tasks, e.g.,
convolution neural network, recurrent neural network, transfer learning, and
optimal transport. Additionally, typical public toolboxes that implement
optimization on manifold are also discussed. Finally, this article makes a
performance comparison between different deep geometric optimization methods
under image recognition scenarios.Comment: 41 page
Geometric Clifford Algebra Networks
We propose Geometric Clifford Algebra Networks (GCANs) for modeling dynamical
systems. GCANs are based on symmetry group transformations using geometric
(Clifford) algebras. We first review the quintessence of modern (plane-based)
geometric algebra, which builds on isometries encoded as elements of the
group. We then propose the concept of group action
layers, which linearly combine object transformations using pre-specified group
actions. Together with a new activation and normalization scheme, these layers
serve as adjustable that can be refined via
gradient descent. Theoretical advantages are strongly reflected in the modeling
of three-dimensional rigid body transformations as well as large-scale fluid
dynamics simulations, showing significantly improved performance over
traditional methods
Training recurrent neural networks via forward propagation through time
Back-propagation through time (BPTT) has been
widely used for training Recurrent Neural Networks
(RNNs). BPTT updates RNN parameters
on an instance by back-propagating the error
in time over the entire sequence length, and
as a result, leads to poor trainability due to the
well-known gradient explosion/decay phenomena.
While a number of prior works have proposed to
mitigate vanishing/explosion effect through careful
RNN architecture design, these RNN variants
still train with BPTT.We propose a novel forwardpropagation
algorithm, FPTT , where at each time,
for an instance, we update RNN parameters by
optimizing an instantaneous risk function. Our
proposed risk is a regularization penalty at time
t that evolves dynamically based on previously
observed losses, and allows for RNN parameter
updates to converge to a stationary solution
of the empirical RNN objective. We consider
both sequence-to-sequence as well as terminal
loss problems. Empirically FPTT outperforms
BPTT on a number of well-known benchmark
tasks, thus enabling architectures like LSTMs to
solve long range dependencies problems.http://proceedings.mlr.press/v139/kag21a/kag21a.pd
Proceedings of the 19th Sound and Music Computing Conference
Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Étienne (France).
https://smc22.grame.f
Dissipative Deep Neural Dynamical Systems
In this paper, we provide sufficient conditions for dissipativity and local
asymptotic stability of discrete-time dynamical systems parametrized by deep
neural networks. We leverage the representation of neural networks as pointwise
affine maps, thus exposing their local linear operators and making them
accessible to classical system analytic and design methods. This allows us to
"crack open the black box" of the neural dynamical system's behavior by
evaluating their dissipativity, and estimating their stationary points and
state-space partitioning. We relate the norms of these local linear operators
to the energy stored in the dissipative system with supply rates represented by
their aggregate bias terms. Empirically, we analyze the variance in dynamical
behavior and eigenvalue spectra of these local linear operators with varying
weight factorizations, activation functions, bias terms, and depths.Comment: Under review at IEEE Open Journal of Control System
Householder-Absolute Neural Layers For High Variability and Deep Trainability
We propose a new architecture for artificial neural networks called
Householder-absolute neural layers, or Han-layers for short, that use
Householder reflectors as weight matrices and the absolute-value function for
activation. Han-layers, functioning as fully connected layers, are motivated by
recent results on neural-network variability and are designed to increase
activation ratio and reduce the chance of Collapse to Constants. Neural
networks constructed chiefly from Han-layers are called HanNets. By
construction, HanNets enjoy a theoretical guarantee that vanishing or exploding
gradient never occurs. We conduct several proof-of-concept experiments. Some
surprising results obtained on styled test problems suggest that, under certain
conditions, HanNets exhibit an unusual ability to produce nearly perfect
solutions unattainable by fully connected networks. Experiments on regression
datasets show that HanNets can significantly reduce the number of model
parameters while maintaining or improving the level of generalization accuracy.
In addition, by adding a few Han-layers into the pre-classification FC-layer of
a convolutional neural network, we are able to quickly improve a
state-of-the-art result on CIFAR10 dataset. These proof-of-concept results are
sufficient to necessitate further studies on HanNets to understand their
capacities and limits, and to exploit their potentials in real-world
applications
RNN training along locally optimal trajectoriesvia Frank-Wolfe algorithm
We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region, and leverage this directional vector for the update, in an outer-loop. We propose to utilize the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly involves normalized gradients, which can lead to a slow convergence rate, we develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than backpropagation. Our method leads to a new Frank-Wolfe method, that is in essence an SGD algorithm with a restart scheme. We prove that under certain conditions our algorithm has a sublinear convergence rate of O (1/ϵ) for ϵ error. We then conduct empirical experiments on several benchmark datasets including those that exhibit long-term dependencies, and show significant performance improvement. We also experiment with deep RNN architectures and show efficient training performance. Finally, we demonstrate that our training method is robust to noisy data.https://doi.org/10.1109/icpr48806.2021.9412188Accepted manuscrip
- …