10,152 research outputs found
VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation
Rich and dense human labeled datasets are among the main enabling factors for
the recent advance on vision-language understanding. Many seemingly distant
annotations (e.g., semantic segmentation and visual question answering (VQA))
are inherently connected in that they reveal different levels and perspectives
of human understandings about the same visual scenes --- and even the same set
of images (e.g., of COCO). The popularity of COCO correlates those annotations
and tasks. Explicitly linking them up may significantly benefit both individual
tasks and the unified vision and language modeling. We present the preliminary
work of linking the instance segmentations provided by COCO to the questions
and answers (QAs) in the VQA dataset, and name the collected links visual
questions and segmentation answers (VQS). They transfer human supervision
between the previously separate tasks, offer more effective leverage to
existing problems, and also open the door for new research problems and models.
We study two applications of the VQS data in this paper: supervised attention
for VQA and a novel question-focused semantic segmentation task. For the
former, we obtain state-of-the-art results on the VQA real multiple-choice task
by simply augmenting the multilayer perceptrons with some attention features
that are learned using the segmentation-QA links as explicit supervision. To
put the latter in perspective, we study two plausible methods and compare them
to an oracle method assuming that the instance segmentations are given at the
test stage.Comment: To appear on ICCV 201
Multimedia information technology and the annotation of video
The state of the art in multimedia information technology has not progressed to the point where a single solution is available to meet all reasonable needs of documentalists and users of video archives. In general, we do not have an optimistic view of the usability of new technology in this domain, but digitization and digital power can be expected to cause a small revolution in the area of video archiving. The volume of data leads to two views of the future: on the pessimistic side, overload of data will cause lack of annotation capacity, and on the optimistic side, there will be enough data from which to learn selected concepts that can be deployed to support automatic annotation. At the threshold of this interesting era, we make an attempt to describe the state of the art in technology. We sample the progress in text, sound, and image processing, as well as in machine learning
Improving Small Object Proposals for Company Logo Detection
Many modern approaches for object detection are two-staged pipelines. The
first stage identifies regions of interest which are then classified in the
second stage. Faster R-CNN is such an approach for object detection which
combines both stages into a single pipeline. In this paper we apply Faster
R-CNN to the task of company logo detection. Motivated by its weak performance
on small object instances, we examine in detail both the proposal and the
classification stage with respect to a wide range of object sizes. We
investigate the influence of feature map resolution on the performance of those
stages.
Based on theoretical considerations, we introduce an improved scheme for
generating anchor proposals and propose a modification to Faster R-CNN which
leverages higher-resolution feature maps for small objects. We evaluate our
approach on the FlickrLogos dataset improving the RPN performance from 0.52 to
0.71 (MABO) and the detection performance from 0.52 to 0.67 (mAP).Comment: 8 Pages, ICMR 201
- …