In this paper, we present a simple yet effective method (ABSGD) for
addressing the data imbalance issue in deep learning. Our method is a simple
modification to momentum SGD where we leverage an attentional mechanism to
assign an individual importance weight to each gradient in the mini-batch.
Unlike many existing heuristic-driven methods for tackling data imbalance, our
method is grounded in {\it theoretically justified distributionally robust
optimization (DRO)}, which is guaranteed to converge to a stationary point of
an information-regularized DRO problem. The individual-level weight of a
sampled data is systematically proportional to the exponential of a scaled loss
value of the data, where the scaling factor is interpreted as the
regularization parameter in the framework of information-regularized DRO.
Compared with existing class-level weighting schemes, our method can capture
the diversity between individual examples within each class. Compared with
existing individual-level weighting methods using meta-learning that require
three backward propagations for computing mini-batch stochastic gradients, our
method is more efficient with only one backward propagation at each iteration
as in standard deep learning methods. To balance between the learning of
feature extraction layers and the learning of the classifier layer, we employ a
two-stage method that uses SGD for pretraining followed by ABSGD for learning a
robust classifier and finetuning lower layers. Our empirical studies on several
benchmark datasets demonstrate the effectiveness of the proposed method.Comment: 29pages, 10 figure