8 research outputs found
Learning Representations Toward the Understanding of Out-of-Distribution for Neural Networks
Data-driven representations achieve powerful generalization performance in diverse information processing tasks. However, the generalization is often limited to test data from the same distribution as training data (in-distribution (ID)). In addition, the neural networks often make overconfident and incorrect predictions for data outside training distribution, called out-of-distribution (OOD). In this dissertation, we develop representations that can characterize OOD for the neural networks and utilize the characterization to efficiently generalize to OOD. We categorize the data-driven representations based on information flow in neural networks and develop novel gradient-based representations. In particular, we utilize the backpropagated gradients to represent what the neural networks has not learned in the data. The capability of gradient-based representations for OOD characterization is comprehensively analyzed in comparison with standard activation-based representations. We also utilize a regularization technique for the gradient-based representations to better characterize OOD. Finally, we develop activation-based representations learned with auxiliary information to efficiently generalize to data from OOD. We use an unsupervised learning framework to learn the aligned representations of visual and attribute data. These aligned representations are utilized to calibrate the overconfident prediction toward ID classes and the generalization performance is validated in the application of generalized zero-shot learning (GZSL). The developed GZSL method, GatingAE, achieves state-of-the-art performance in generalizing to OOD with significantly less number of model parameters compared to other state-of-the-art methods.Ph.D
Distorted Representation Space Characterization Through Backpropagated Gradients
In this paper, we utilize weight gradients from backpropagation to
characterize the representation space learned by deep learning algorithms. We
demonstrate the utility of such gradients in applications including perceptual
image quality assessment and out-of-distribution classification. The
applications are chosen to validate the effectiveness of gradients as features
when the test image distribution is distorted from the train image
distribution. In both applications, the proposed gradient based features
outperform activation features. In image quality assessment, the proposed
approach is compared with other state of the art approaches and is generally
the top performing method on TID 2013 and MULTI-LIVE databases in terms of
accuracy, consistency, linearity, and monotonic behavior. Finally, we analyze
the effect of regularization on gradients using CURE-TSR dataset for
out-of-distribution classification.Comment: 5 pages, 5 figures, 2 tables, ICIP 201
Masked Vision and Language Modeling for Multi-modal Representation Learning
In this paper, we study how to use masked signal modeling in vision and
language (V+L) representation learning. Instead of developing masked language
modeling (MLM) and masked image modeling (MIM) independently, we propose to
build joint masked vision and language modeling, where the masked signal of one
modality is reconstructed with the help from another modality. This is
motivated by the nature of image-text paired data that both of the image and
the text convey almost the same information but in different formats. The
masked signal reconstruction of one modality conditioned on another modality
can also implicitly learn cross-modal alignment between language tokens and
image patches. Our experiments on various V+L tasks show that the proposed
method not only achieves state-of-the-art performances by using a large amount
of data, but also outperforms the other competitors by a significant margin in
the regimes of limited training data
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
The open-ended Visual Question Answering (VQA) task requires AI models to
jointly reason over visual and natural language inputs using world knowledge.
Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to
the task and shown to be powerful world knowledge sources. However, these
methods suffer from low knowledge coverage caused by PLM bias -- the tendency
to generate certain tokens over other tokens regardless of prompt changes, and
high dependency on the PLM quality -- only models using GPT-3 can achieve the
best result.
To address the aforementioned challenges, we propose RASO: a new VQA pipeline
that deploys a generate-then-select strategy guided by world knowledge for the
first time. Rather than following the de facto standard to train a multi-modal
model that directly generates the VQA answer, RASO first adopts PLM to generate
all the possible answers, and then trains a lightweight answer selection model
for the correct answer. As proved in our analysis, RASO expands the knowledge
coverage from in-domain training data by a large margin. We provide extensive
experimentation and show the effectiveness of our pipeline by advancing the
state-of-the-art by 4.1% on OK-VQA, without additional computation cost. Code
and models are released at http://cogcomp.org/page/publication_view/1010Comment: Accepted to ACL 2023 Finding