39 research outputs found
Auto-Encoding Scene Graphs for Image Captioning
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language
inductive bias into the encoder-decoder image captioning framework for more
human-like captions. Intuitively, we humans use the inductive bias to compose
collocations and contextual inference in discourse. For example, when we see
the relation `person on bike', it is natural to replace `on' with `ride' and
infer `person riding bike on a road' even the `road' is not evident. Therefore,
exploiting such bias as a language prior is expected to help the conventional
encoder-decoder models less likely overfit to the dataset bias and focus on
reasoning. Specifically, we use the scene graph --- a directed graph
() where an object node is connected by adjective nodes and
relationship nodes --- to represent the complex structural layout of both image
() and sentence (). In the textual domain, we use
SGAE to learn a dictionary () that helps to reconstruct sentences
in the pipeline, where encodes the desired language prior;
in the vision-language domain, we use the shared to guide the
encoder-decoder in the pipeline. Thanks to the scene graph
representation and shared dictionary, the inductive bias is transferred across
domains in principle. We validate the effectiveness of SGAE on the challenging
MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves
a new state-of-the-art CIDEr-D on the Karpathy split, and a competitive
CIDEr-D (c40) on the official server even compared to other ensemble
models
Invariant Feature Learning for Generalized Long-Tailed Classification
Existing long-tailed classification (LT) methods only focus on tackling the
class-wise imbalance that head classes have more samples than tail classes, but
overlook the attribute-wise imbalance. In fact, even if the class is balanced,
samples within each class may still be long-tailed due to the varying
attributes. Note that the latter is fundamentally more ubiquitous and
challenging than the former because attributes are not just implicit for most
datasets, but also combinatorially complex, thus prohibitively expensive to be
balanced. Therefore, we introduce a novel research problem: Generalized
Long-Tailed classification (GLT), to jointly consider both kinds of imbalances.
By "generalized", we mean that a GLT method should naturally solve the
traditional LT, but not vice versa. Not surprisingly, we find that most
class-wise LT methods degenerate in our proposed two benchmarks: ImageNet-GLT
and MSCOCO-GLT. We argue that it is because they over-emphasize the adjustment
of class distribution while neglecting to learn attribute-invariant features.
To this end, we propose an Invariant Feature Learning (IFL) method as the first
strong baseline for GLT. IFL first discovers environments with divergent
intra-class distributions from the imperfect predictions and then learns
invariant features across them. Promisingly, as an improved feature backbone,
IFL boosts all the LT line-up: one/two-stage re-balance, augmentation, and
ensemble. Codes and benchmarks are available on Github:
https://github.com/KaihuaTang/Generalized-Long-Tailed-Benchmarks.pytorchComment: Accepted to ECCV 2022. Codes and benchmarks are available on Github:
https://github.com/KaihuaTang/Generalized-Long-Tailed-Benchmarks.pytorc
Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization
Out-Of-Distribution generalization (OOD) is all about learning invariance
against environmental changes. If the context in every class is evenly
distributed, OOD would be trivial because the context can be easily removed due
to an underlying principle: class is invariant to context. However, collecting
such a balanced dataset is impractical. Learning on imbalanced data makes the
model bias to context and thus hurts OOD. Therefore, the key to OOD is context
balance. We argue that the widely adopted assumption in prior work, the context
bias can be directly annotated or estimated from biased class prediction,
renders the context incomplete or even incorrect. In contrast, we point out the
everoverlooked other side of the above principle: context is also invariant to
class, which motivates us to consider the classes (which are already labeled)
as the varying environments to resolve context bias (without context labels).
We implement this idea by minimizing the contrastive loss of intra-class sample
similarity while assuring this similarity to be invariant across all classes.
On benchmarks with various context biases and domain gaps, we show that a
simple re-weighting based classifier equipped with our context estimation
achieves state-of-the-art performance. We provide the theoretical
justifications in Appendix and codes on
https://github.com/simpleshinobu/IRMCon.Comment: Accepted by ECCV 202
Identifying Hard Noise in Long-Tailed Sample Distribution
Conventional de-noising methods rely on the assumption that all samples are
independent and identically distributed, so the resultant classifier, though
disturbed by noise, can still easily identify the noises as the outliers of
training distribution. However, the assumption is unrealistic in large-scale
data that is inevitably long-tailed. Such imbalanced training data makes a
classifier less discriminative for the tail classes, whose previously "easy"
noises are now turned into "hard" ones -- they are almost as outliers as the
clean tail samples. We introduce this new challenge as Noisy Long-Tailed
Classification (NLT). Not surprisingly, we find that most de-noising methods
fail to identify the hard noises, resulting in significant performance drop on
the three proposed NLT benchmarks: ImageNet-NLT, Animal10-NLT, and Food101-NLT.
To this end, we design an iterative noisy learning framework called
Hard-to-Easy (H2E). Our bootstrapping philosophy is to first learn a classifier
as noise identifier invariant to the class and context distributional changes,
reducing "hard" noises to "easy" ones, whose removal further improves the
invariance. Experimental results show that our H2E outperforms state-of-the-art
de-noising methods and their ablations on long-tailed settings while
maintaining a stable performance on the conventional balanced settings.
Datasets and codes are available at https://github.com/yxymessi/H2E-FrameworkComment: Accepted to ECCV2022(Oral) ; Datasets and codes are available at
https://github.com/yxymessi/H2E-Framewor