66 research outputs found
Normal Factor Graphs and Holographic Transformations
This paper stands at the intersection of two distinct lines of research. One
line is "holographic algorithms," a powerful approach introduced by Valiant for
solving various counting problems in computer science; the other is "normal
factor graphs," an elegant framework proposed by Forney for representing codes
defined on graphs. We introduce the notion of holographic transformations for
normal factor graphs, and establish a very general theorem, called the
generalized Holant theorem, which relates a normal factor graph to its
holographic transformation. We show that the generalized Holant theorem on the
one hand underlies the principle of holographic algorithms, and on the other
hand reduces to a general duality theorem for normal factor graphs, a special
case of which was first proved by Forney. In the course of our development, we
formalize a new semantics for normal factor graphs, which highlights various
linear algebraic properties that potentially enable the use of normal factor
graphs as a linear algebraic tool.Comment: To appear IEEE Trans. Inform. Theor
MixUp as Locally Linear Out-Of-Manifold Regularization
MixUp is a recently proposed data-augmentation scheme, which linearly
interpolates a random pair of training examples and correspondingly the one-hot
representations of their labels. Training deep neural networks with such
additional data is shown capable of significantly improving the predictive
accuracy of the current art. The power of MixUp, however, is primarily
established empirically and its working and effectiveness have not been
explained in any depth. In this paper, we develop an understanding for MixUp as
a form of "out-of-manifold regularization", which imposes certain "local
linearity" constraints on the model's input space beyond the data manifold.
This analysis enables us to identify a limitation of MixUp, which we call
"manifold intrusion". In a nutshell, manifold intrusion in MixUp is a form of
under-fitting resulting from conflicts between the synthetic labels of the
mixed-up examples and the labels of original training data. Such a phenomenon
usually happens when the parameters controlling the generation of mixing
policies are not sufficiently fine-tuned on the training data. To address this
issue, we propose a novel adaptive version of MixUp, where the mixing policies
are automatically learned from the data using an additional network and
objective function designed to avoid manifold intrusion. The proposed
regularizer, AdaMixUp, is empirically evaluated on several benchmark datasets.
Extensive experiments demonstrate that AdaMixUp improves upon MixUp when
applied to the current art of deep classification models.Comment: Accepted by AAAI201
Tighter Information-Theoretic Generalization Bounds from Supersamples
In this work, we present a variety of novel information-theoretic
generalization bounds for learning algorithms, from the supersample setting of
Steinke & Zakynthinou (2020)-the setting of the "conditional mutual
information" framework. Our development exploits projecting the loss pair
(obtained from a training instance and a testing instance) down to a single
number and correlating loss values with a Rademacher sequence (and its shifted
variants). The presented bounds include square-root bounds, fast-rate bounds,
including those based on variance and sharpness, and bounds for interpolating
algorithms etc. We show theoretically or empirically that these bounds are
tighter than all information-theoretic bounds known to date on the same
supersample setting.Comment: Accepted to ICML 202
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
Stochastic differential equations (SDEs) have been shown recently to well
characterize the dynamics of training machine learning models with SGD. This
provides two opportunities for better understanding the generalization
behaviour of SGD through its SDE approximation. First, under the SDE
characterization, SGD may be regarded as the full-batch gradient descent with
Gaussian gradient noise. This allows the application of the generalization
bounds developed by Xu & Raginsky (2017) to analyzing the generalization
behaviour of SGD, resulting in upper bounds in terms of the mutual information
between the training set and the training trajectory. Second, under mild
assumptions, it is possible to obtain an estimate of the steady-state weight
distribution of SDE. Using this estimate, we apply the PAC-Bayes-like
information-theoretic bounds developed in both Xu & Raginsky (2017) and Negrea
et al. (2019) to obtain generalization upper bounds in terms of the KL
divergence between the steady-state weight distribution of SGD with respect to
a prior distribution. Among various options, one may choose the prior as the
steady-state weight distribution obtained by SGD on the same training set but
with one example held out. In this case, the bound can be elegantly expressed
using the influence function (Koh & Liang, 2017), which suggests that the
generalization of the SGD is related to the stability of SGD. Various insights
are presented along the development of these bounds, which are subsequently
validated numerically
ifMixup: Interpolating Graph Pair to Regularize Graph Classification
We present a simple and yet effective interpolation-based regularization
technique, aiming to improve the generalization of Graph Neural Networks (GNNs)
on supervised graph classification. We leverage Mixup, an effective regularizer
for vision, where random sample pairs and their labels are interpolated to
create synthetic images for training. Unlike images with grid-like coordinates,
graphs have arbitrary structure and topology, which can be very sensitive to
any modification that alters the graph's semantic meanings. This posts two
unanswered questions for Mixup-like regularization schemes: Can we directly mix
up a pair of graph inputs? If so, how well does such mixing strategy regularize
the learning of GNNs? To answer these two questions, we propose ifMixup, which
first adds dummy nodes to make two graphs have the same input size and then
simultaneously performs linear interpolation between the aligned node feature
vectors and the aligned edge representations of the two graphs. We empirically
show that such simple mixing schema can effectively regularize the
classification learning, resulting in superior predictive accuracy to popular
graph augmentation and GNN methods.Comment: To appear in AAAI202
- …