118,393 research outputs found
Learning Convolutional Text Representations for Visual Question Answering
Visual question answering is a recently proposed artificial intelligence task
that requires a deep understanding of both images and texts. In deep learning,
images are typically modeled through convolutional neural networks, and texts
are typically modeled through recurrent neural networks. While the requirement
for modeling images is similar to traditional computer vision tasks, such as
object recognition and image classification, visual question answering raises a
different need for textual representation as compared to other natural language
processing tasks. In this work, we perform a detailed analysis on natural
language questions in visual question answering. Based on the analysis, we
propose to rely on convolutional neural networks for learning textual
representations. By exploring the various properties of convolutional neural
networks specialized for text data, such as width and depth, we present our
"CNN Inception + Gate" model. We show that our model improves question
representations and thus the overall accuracy of visual question answering
models. We also show that the text representation requirement in visual
question answering is more complicated and comprehensive than that in
conventional natural language processing tasks, making it a better task to
evaluate textual representation methods. Shallow models like fastText, which
can obtain comparable results with deep learning models in tasks like text
classification, are not suitable in visual question answering.Comment: Conference paper at SDM 2018. https://github.com/divelab/sva
Feature Type Analysis in Automated Genre Classification
In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.
Image-based Text Classification using 2D Convolutional Neural Networks
We propose a new approach to text classification
in which we consider the input text as an image and apply
2D Convolutional Neural Networks to learn the local and
global semantics of the sentences from the variations of the
visual patterns of words. Our approach demonstrates that
it is possible to get semantically meaningful features from
images with text without using optical character recognition
and sequential processing pipelines, techniques that traditional
natural language processing algorithms require. To validate
our approach, we present results for two applications: text
classification and dialog modeling. Using a 2D Convolutional
Neural Network, we were able to outperform the state-ofart
accuracy results for a Chinese text classification task and
achieved promising results for seven English text classification
tasks. Furthermore, our approach outperformed the memory
networks without match types when using out of vocabulary
entities from Task 4 of the bAbI dialog dataset
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Social media is daily creating massive multimedia content with paired image
and text, presenting the pressing need to automate the vision and language
understanding for various multimodal classification tasks. Compared to the
commonly researched visual-lingual data, social media posts tend to exhibit
more implicit image-text relations. To better glue the cross-modal semantics
therein, we capture hinting features from user comments, which are retrieved
via jointly leveraging visual and lingual similarity. Afterwards, the
classification tasks are explored via self-training in a teacher-student
framework, motivated by the usually limited labeled data scales in existing
benchmarks. Substantial experiments are conducted on four multimodal social
media benchmarks for image text relation classification, sarcasm detection,
sentiment classification, and hate speech detection. The results show that our
method further advances the performance of previous state-of-the-art models,
which do not employ comment modeling or self-training.Comment: accepted to EMNLP 202
Chain of Thought Prompt Tuning in Vision Language Models
Language-Image Pre-training has demonstrated promising results on zero-shot
and few-shot downstream tasks by prompting visual models with natural language
prompts. However, most recent studies only use a single prompt for tuning,
neglecting the inherent step-to-step cognitive reasoning process that humans
conduct in complex task settings, for example, when processing images from
unfamiliar domains. Chain of Thought is a simple and effective approximation to
human reasoning process and has been proven useful for natural language
processing (NLP) tasks. Based on this cognitive intuition, we believe that
conducting effective reasoning is also an important problem in visual tasks,
and a chain of thought could be a solution to this problem. In this work, we
propose a novel chain of thought prompt tuning for vision-language modeling.
Extensive experiments show that our method not only generalizes better in image
classification tasks, has greater transferability beyond a single dataset, and
has stronger domain generalization performance, but also performs much better
in imagetext retrieval and visual question answering, which require more
reasoning capabilities. We are the first to successfully adapt chain-of-thought
prompting that combines visual and textual embeddings. We will release our
code
BEiT: BERT Pre-Training of Image Transformers
We introduce a self-supervised vision representation model BEiT, which stands
for Bidirectional Encoder representation from Image Transformers. Following
BERT developed in the natural language processing area, we propose a masked
image modeling task to pretrain vision Transformers. Specifically, each image
has two views in our pre-training, i.e, image patches (such as 16x16 pixels),
and visual tokens (i.e., discrete tokens). We first "tokenize" the original
image into visual tokens. Then we randomly mask some image patches and fed them
into the backbone Transformer. The pre-training objective is to recover the
original visual tokens based on the corrupted image patches. After pre-training
BEiT, we directly fine-tune the model parameters on downstream tasks by
appending task layers upon the pretrained encoder. Experimental results on
image classification and semantic segmentation show that our model achieves
competitive results with previous pre-training methods. For example, base-size
BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming
from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size
BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with
supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models
are available at https://aka.ms/beit.Comment: A Path to the BERT Moment of CV. Work in progres
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
This paper reveals that large language models (LLMs), despite being trained
solely on textual data, are surprisingly strong encoders for purely visual
tasks in the absence of language. Even more intriguingly, this can be achieved
by a simple yet previously overlooked strategy -- employing a frozen
transformer block from pre-trained LLMs as a constituent encoder layer to
directly process visual tokens. Our work pushes the boundaries of leveraging
LLMs for computer vision tasks, significantly departing from conventional
practices that typically necessitate a multi-modal vision-language setup with
associated language prompts, inputs, or outputs. We demonstrate that our
approach consistently enhances performance across a diverse range of tasks,
encompassing pure 2D and 3D visual recognition tasks (e.g., image and point
cloud classification), temporal modeling tasks (e.g., action recognition),
non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g.,
2D/3D visual question answering and image-text retrieval). Such improvements
are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and
OPT) and different LLM transformer blocks. We additionally propose the
information filtering hypothesis to explain the effectiveness of pre-trained
LLMs in visual encoding -- the pre-trained LLM transformer blocks discern
informative visual tokens and further amplify their effect. This hypothesis is
empirically supported by the observation that the feature activation, after
training with LLM transformer blocks, exhibits a stronger focus on relevant
regions. We hope that our work inspires new perspectives on utilizing LLMs and
deepening our understanding of their underlying mechanisms. Code is available
at https://github.com/ziqipang/LM4VisualEncoding.Comment: 23 pages, 13 figures. Code at
https://github.com/ziqipang/LM4VisualEncodin
- …