38 research outputs found
A Bayesian Data Augmentation Approach for Learning Deep Models
Data augmentation is an essential part of the training process applied to
deep learning models. The motivation is that a robust training process for deep
learning models depends on large annotated datasets, which are expensive to be
acquired, stored and processed. Therefore a reasonable alternative is to be
able to automatically generate new annotated training samples using a process
known as data augmentation. The dominant data augmentation approach in the
field assumes that new training samples can be obtained via random geometric or
appearance transformations applied to annotated training samples, but this is a
strong assumption because it is unclear if this is a reliable generative model
for producing new training samples. In this paper, we provide a novel Bayesian
formulation to data augmentation, where new annotated training points are
treated as missing variables and generated based on the distribution learned
from the training set. For learning, we introduce a theoretically sound
algorithm --- generalised Monte Carlo expectation maximisation, and demonstrate
one possible implementation via an extension of the Generative Adversarial
Network (GAN). Classification results on MNIST, CIFAR-10 and CIFAR-100 show the
better performance of our proposed method compared to the current dominant data
augmentation approach mentioned above --- the results also show that our
approach produces better classification results than similar GAN models.Comment: Accepted to NISP 201
Syntax-aware Data Augmentation for Neural Machine Translation
Data augmentation is an effective performance enhancement in neural machine
translation (NMT) by generating additional bilingual data. In this paper, we
propose a novel data augmentation enhancement strategy for neural machine
translation. Different from existing data augmentation methods which simply
choose words with the same probability across different sentences for
modification, we set sentence-specific probability for word selection by
considering their roles in sentence. We use dependency parse tree of input
sentence as an effective clue to determine selecting probability for every
words in each sentence. Our proposed method is evaluated on WMT14
English-to-German dataset and IWSLT14 German-to-English dataset. The result of
extensive experiments show our proposed syntax-aware data augmentation method
may effectively boost existing sentence-independent methods for significant
translation performance improvement
G2R Bound: A Generalization Bound for Supervised Learning from GAN-Synthetic Data
Performing supervised learning from the data synthesized by using Generative
Adversarial Networks (GANs), dubbed GAN-synthetic data, has two important
applications. First, GANs may generate more labeled training data, which may
help improve classification accuracy. Second, in scenarios where real data
cannot be released outside certain premises for privacy and/or security
reasons, using GAN- synthetic data to conduct training is a plausible
alternative. This paper proposes a generalization bound to guarantee the
generalization capability of a classifier learning from GAN-synthetic data.
This generalization bound helps developers gauge the generalization gap between
learning from synthetic data and testing on real data, and can therefore
provide the clues to improve the generalization capability
Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules
A key challenge in leveraging data augmentation for neural network training
is choosing an effective augmentation policy from a large search space of
candidate operations. Properly chosen augmentation policies can lead to
significant generalization improvements; however, state-of-the-art approaches
such as AutoAugment are computationally infeasible to run for the ordinary
user. In this paper, we introduce a new data augmentation algorithm, Population
Based Augmentation (PBA), which generates nonstationary augmentation policy
schedules instead of a fixed augmentation policy. We show that PBA can match
the performance of AutoAugment on CIFAR-10, CIFAR-100, and SVHN, with three
orders of magnitude less overall compute. On CIFAR-10 we achieve a mean test
error of 1.46%, which is a slight improvement upon the current
state-of-the-art. The code for PBA is open source and is available at
https://github.com/arcelien/pba.Comment: ICML 201
DADA: Deep Adversarial Data Augmentation for Extremely Low Data Regime Classification
Deep learning has revolutionized the performance of classification, but
meanwhile demands sufficient labeled data for training. Given insufficient
data, while many techniques have been developed to help combat overfitting, the
challenge remains if one tries to train deep networks, especially in the
ill-posed extremely low data regimes: only a small set of labeled data are
available, and nothing -- including unlabeled data -- else. Such regimes arise
from practical situations where not only data labeling but also data collection
itself is expensive. We propose a deep adversarial data augmentation (DADA)
technique to address the problem, in which we elaborately formulate data
augmentation as a problem of training a class-conditional and supervised
generative adversarial network (GAN). Specifically, a new discriminator loss is
proposed to fit the goal of data augmentation, through which both real and
augmented samples are enforced to contribute to and be consistent in finding
the decision boundaries. Tailored training techniques are developed
accordingly. To quantitatively validate its effectiveness, we first perform
extensive simulations to show that DADA substantially outperforms both
traditional data augmentation and a few GAN-based options. We then extend
experiments to three real-world small labeled datasets where existing data
augmentation and/or transfer learning strategies are either less effective or
infeasible. All results endorse the superior capability of DADA in enhancing
the generalization ability of deep networks trained in practical extremely low
data regimes. Source code is available at
https://github.com/SchafferZhang/DADA.Comment: 15 pages, 5 figure
Learning to Generate Synthetic Data via Compositing
We present a task-aware approach to synthetic data generation. Our framework
employs a trainable synthesizer network that is optimized to produce meaningful
training samples by assessing the strengths and weaknesses of a `target'
network. The synthesizer and target networks are trained in an adversarial
manner wherein each network is updated with a goal to outdo the other.
Additionally, we ensure the synthesizer generates realistic data by pairing it
with a discriminator trained on real-world images. Further, to make the target
classifier invariant to blending artefacts, we introduce these artefacts to
background regions of the training images so the target does not over-fit to
them.
We demonstrate the efficacy of our approach by applying it to different
target networks including a classification network on AffNIST, and two object
detection networks (SSD, Faster-RCNN) on different datasets. On the AffNIST
benchmark, our approach is able to surpass the baseline results with just half
the training examples. On the VOC person detection benchmark, we show
improvements of up to 2.7% as a result of our data augmentation. Similarly on
the GMU detection benchmark, we report a performance boost of 3.5% in mAP over
the baseline method, outperforming the previous state of the art approaches by
up to 7.5% on specific categories.Comment: Accepted to CVPR 2019, supplementary material include
Robustness and Overfitting Behavior of Implicit Background Models
In this paper, we examine the overfitting behavior of image classification
models modified with Implicit Background Estimation (SCrIBE), which transforms
them into weakly supervised segmentation models that provide spatial domain
visualizations without affecting performance. Using the segmentation masks, we
derive an overfit detection criterion that does not require testing labels. In
addition, we assess the change in model performance, calibration, and
segmentation masks after applying data augmentations as overfitting reduction
measures and testing on various types of distorted images.Comment: 6 pages, 3 figures, accepted to IEEE International Conference on
Image Processing (ICIP
Efficient Training of Deep Convolutional Neural Networks by Augmentation in Embedding Space
Recent advances in the field of artificial intelligence have been made
possible by deep neural networks. In applications where data are scarce,
transfer learning and data augmentation techniques are commonly used to improve
the generalization of deep learning models. However, fine-tuning a transfer
model with data augmentation in the raw input space has a high computational
cost to run the full network for every augmented input. This is particularly
critical when large models are implemented on embedded devices with limited
computational and energy resources. In this work, we propose a method that
replaces the augmentation in the raw input space with an approximate one that
acts purely in the embedding space. Our experimental results show that the
proposed method drastically reduces the computation, while the accuracy of
models is negligibly compromised
Using BibTeX to Automatically Generate Labeled Data for Citation Field Extraction
Accurate parsing of citation reference strings is crucial to automatically
construct scholarly databases such as Google Scholar or Semantic Scholar.
Citation field extraction (CFE) is precisely this task---given a reference
label which tokens refer to the authors, venue, title, editor, journal, pages,
etc. Most methods for CFE are supervised and rely on training from labeled
datasets that are quite small compared to the great variety of reference
formats. BibTeX, the widely used reference management tool, provides a natural
method to automatically generate and label training data for CFE. In this
paper, we describe a technique for using BibTeX to generate, automatically, a
large-scale 41M labeled strings), labeled dataset, that is four orders of
magnitude larger than the current largest CFE dataset, namely the UMass
Citation Field Extraction dataset [Anzaroot and McCallum, 2013]. We
experimentally demonstrate how our dataset can be used to improve the
performance of the UMass CFE using a RoBERTa-based [Liu et al., 2019] model. In
comparison to previous SoTA, we achieve a 24.48% relative error reduction,
achieving span level F1-scores of 96.3%
This dataset does not exist: training models from generated images
Current generative networks are increasingly proficient in generating
high-resolution realistic images. These generative networks, especially the
conditional ones, can potentially become a great tool for providing new image
datasets. This naturally brings the question: Can we train a classifier only on
the generated data? This potential availability of nearly unlimited amounts of
training data challenges standard practices for training machine learning
models, which have been crafted across the years for limited and fixed size
datasets. In this work we investigate this question and its related challenges.
We identify ways to improve significantly the performance over naive training
on randomly generated images with regular heuristics. We propose three
standalone techniques that can be applied at different stages of the pipeline,
i.e., data generation, training on generated data, and deploying on real data.
We evaluate our proposed approaches on a subset of the ImageNet dataset and
show encouraging results compared to classifiers trained on real images