485 research outputs found
A Neural ODE Interpretation of Transformer Layers
Transformer layers, which use an alternating pattern of multi-head attention
and multi-layer perceptron (MLP) layers, provide an effective tool for a
variety of machine learning problems. As the transformer layers use residual
connections to avoid the problem of vanishing gradients, they can be viewed as
the numerical integration of a differential equation. In this extended
abstract, we build upon this connection and propose a modification of the
internal architecture of a transformer layer. The proposed model places the
multi-head attention sublayer and the MLP sublayer parallel to each other. Our
experiments show that this simple modification improves the performance of
transformer networks in multiple tasks. Moreover, for the image classification
task, we show that using neural ODE solvers with a sophisticated integration
scheme further improves performance
STEER: Simple Temporal Regularization For Neural ODEs
Training Neural Ordinary Differential Equations (ODEs) is often
computationally expensive. Indeed, computing the forward pass of such models
involves solving an ODE which can become arbitrarily complex during training.
Recent works have shown that regularizing the dynamics of the ODE can partially
alleviate this. In this paper we propose a new regularization technique:
randomly sampling the end time of the ODE during training. The proposed
regularization is simple to implement, has negligible overhead and is effective
across a wide variety of tasks. Further, the technique is orthogonal to several
other methods proposed to regularize the dynamics of ODEs and as such can be
used in conjunction with them. We show through experiments on normalizing
flows, time series models and image recognition that the proposed
regularization can significantly decrease training time and even improve
performance over baseline models.Comment: Neurips 202
PAC bounds of continuous Linear Parameter-Varying systems related to neural ODEs
We consider the problem of learning Neural Ordinary Differential Equations
(neural ODEs) within the context of Linear Parameter-Varying (LPV) systems in
continuous-time. LPV systems contain bilinear systems which are known to be
universal approximators for non-linear systems. Moreover, a large class of
neural ODEs can be embedded into LPV systems. As our main contribution we
provide Probably Approximately Correct (PAC) bounds under stability for LPV
systems related to neural ODEs. The resulting bounds have the advantage that
they do not depend on the integration interval.Comment: 12 page
"Hey, that's not an ODE": Faster ODE Adjoints with 12 Lines of Code
Neural differential equations may be trained by backpropagating gradients via
the adjoint method, which is another differential equation typically solved
using an adaptive-step-size numerical differential equation solver. A proposed
step is accepted if its error, \emph{relative to some norm}, is sufficiently
small; else it is rejected, the step is shrunk, and the process is repeated.
Here, we demonstrate that the particular structure of the adjoint equations
makes the usual choices of norm (such as ) unnecessarily stringent. By
replacing it with a more appropriate (semi)norm, fewer steps are unnecessarily
rejected and the backpropagation is made faster. This requires only minor code
modifications. Experiments on a wide range of tasks---including time series,
generative modeling, and physical control---demonstrate a median improvement of
40% fewer function evaluations. On some problems we see as much as 62% fewer
function evaluations, so that the overall training time is roughly halved
Characteristic Neural Ordinary Differential Equations
We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a
framework for extending Neural Ordinary Differential Equations (NODEs) beyond
ODEs. While NODEs model the evolution of a latent variables as the solution to
an ODE, C-NODE models the evolution of the latent variables as the solution of
a family of first-order quasi-linear partial differential equations (PDEs)
along curves on which the PDEs reduce to ODEs, referred to as characteristic
curves. This in turn allows the application of the standard frameworks for
solving ODEs, namely the adjoint method. Learning optimal characteristic curves
for given tasks improves the performance and computational efficiency, compared
to state of the art NODE models. We prove that the C-NODE framework extends the
classical NODE on classification tasks by demonstrating explicit C-NODE
representable functions not expressible by NODEs. Additionally, we present
C-NODE-based continuous normalizing flows, which describe the density evolution
of latent variables along multiple dimensions. Empirical results demonstrate
the improvements provided by the proposed method for classification and density
estimation on CIFAR-10, SVHN, and MNIST datasets under a similar computational
budget as the existing NODE methods. The results also provide empirical
evidence that the learned curves improve the efficiency of the system through a
lower number of parameters and function evaluations compared with baselines
- …