46 research outputs found
Conditional Image-Text Embedding Networks
This paper presents an approach for grounding phrases in images which jointly
learns multiple text-conditioned embeddings in a single end-to-end model. In
order to differentiate text phrases into semantically distinct subspaces, we
propose a concept weight branch that automatically assigns phrases to
embeddings, whereas prior works predefine such assignments. Our proposed
solution simplifies the representation requirements for individual embeddings
and allows the underrepresented concepts to take advantage of the shared
representations before feeding them into concept-specific layers. Comprehensive
experiments verify the effectiveness of our approach across three phrase
grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where
we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a
strong region-phrase embedding baseline.Comment: ECCV 2018 accepted pape
End-to-End Localization and Ranking for Relative Attributes
We propose an end-to-end deep convolutional network to simultaneously
localize and rank relative visual attributes, given only weakly-supervised
pairwise image comparisons. Unlike previous methods, our network jointly learns
the attribute's features, localization, and ranker. The localization module of
our network discovers the most informative image region for the attribute,
which is then used by the ranking module to learn a ranking model of the
attribute. Our end-to-end framework also significantly speeds up processing and
is much faster than previous methods. We show state-of-the-art ranking results
on various relative attribute datasets, and our qualitative localization
results clearly demonstrate our network's ability to learn meaningful image
patches.Comment: Appears in European Conference on Computer Vision (ECCV), 201
Unsupervised Holistic Image Generation from Key Local Patches
We introduce a new problem of generating an image based on a small number of
key local patches without any geometric prior. In this work, key local patches
are defined as informative regions of the target object or scene. This is a
challenging problem since it requires generating realistic images and
predicting locations of parts at the same time. We construct adversarial
networks to tackle this problem. A generator network generates a fake image as
well as a mask based on the encoder-decoder framework. On the other hand, a
discriminator network aims to detect fake images. The network is trained with
three losses to consider spatial, appearance, and adversarial information. The
spatial loss determines whether the locations of predicted parts are correct.
Input patches are restored in the output image without much modification due to
the appearance loss. The adversarial loss ensures output images are realistic.
The proposed network is trained without supervisory signals since no labels of
key parts are required. Experimental results on six datasets demonstrate that
the proposed algorithm performs favorably on challenging objects and scenes.Comment: 16 page
Many-shot from Low-shot: Learning to Annotate using Mixed Supervision for Object Detection
Object detection has witnessed significant progress by relying on large,
manually annotated datasets. Annotating such datasets is highly time consuming
and expensive, which motivates the development of weakly supervised and
few-shot object detection methods. However, these methods largely underperform
with respect to their strongly supervised counterpart, as weak training signals
\emph{often} result in partial or oversized detections. Towards solving this
problem we introduce, for the first time, an online annotation module (OAM)
that learns to generate a many-shot set of \emph{reliable} annotations from a
larger volume of weakly labelled images. Our OAM can be jointly trained with
any fully supervised two-stage object detection method, providing additional
training annotations on the fly. This results in a fully end-to-end strategy
that only requires a low-shot set of fully annotated images. The integration of
the OAM with Fast(er) R-CNN improves their performance by mAP,
AP50 on PASCAL VOC 2007 and MS-COCO benchmarks, and significantly outperforms
competing methods using mixed supervision.Comment: Accepted at ECCV 2020. Camera-ready version and Appendice
Image Co-localization by Mimicking a Good Detector's Confidence Score Distribution
Given a set of images containing objects from the same category, the task of
image co-localization is to identify and localize each instance. This paper
shows that this problem can be solved by a simple but intriguing idea, that is,
a common object detector can be learnt by making its detection confidence
scores distributed like those of a strongly supervised detector. More
specifically, we observe that given a set of object proposals extracted from an
image that contains the object of interest, an accurate strongly supervised
object detector should give high scores to only a small minority of proposals,
and low scores to most of them. Thus, we devise an entropy-based objective
function to enforce the above property when learning the common object
detector. Once the detector is learnt, we resort to a segmentation approach to
refine the localization. We show that despite its simplicity, our approach
outperforms state-of-the-art methods.Comment: Accepted to Proc. European Conf. Computer Vision 201
A Diagram Is Worth A Dozen Images
Diagrams are common tools for representing complex concepts, relationships
and events, often when it would be difficult to portray the same information
with natural images. Understanding natural images has been extensively studied
in computer vision, while diagram understanding has received little attention.
In this paper, we study the problem of diagram interpretation and reasoning,
the challenging task of identifying the structure of a diagram and the
semantics of its constituents and their relationships. We introduce Diagram
Parse Graphs (DPG) as our representation to model the structure of diagrams. We
define syntactic parsing of diagrams as learning to infer DPGs for diagrams and
study semantic interpretation and reasoning of diagrams in the context of
diagram question answering. We devise an LSTM-based method for syntactic
parsing of diagrams and introduce a DPG-based attention model for diagram
question answering. We compile a new dataset of diagrams with exhaustive
annotations of constituents and relationships for over 5,000 diagrams and
15,000 questions and answers. Our results show the significance of our models
for syntactic parsing and question answering in diagrams using DPGs
Visual place recognition using landmark distribution descriptors
Recent work by Suenderhauf et al. [1] demonstrated improved visual place
recognition using proposal regions coupled with features from convolutional
neural networks (CNN) to match landmarks between views. In this work we extend
the approach by introducing descriptors built from landmark features which also
encode the spatial distribution of the landmarks within a view. Matching
descriptors then enforces consistency of the relative positions of landmarks
between views. This has a significant impact on performance. For example, in
experiments on 10 image-pair datasets, each consisting of 200 urban locations
with significant differences in viewing positions and conditions, we recorded
average precision of around 70% (at 100% recall), compared with 58% obtained
using whole image CNN features and 50% for the method in [1].Comment: 13 page
Post-Turing Methodology: Breaking the Wall on the Way to Artificial General Intelligence
This article offers comprehensive criticism of the Turing test and develops quality criteria for new artificial general intelligence (AGI) assessment tests. It is shown that the prerequisites A. Turing drew upon when reducing personality and human consciousness to “suitable branches of thought” re-flected the engineering level of his time. In fact, the Turing “imitation game” employed only symbolic communication and ignored the physical world. This paper suggests that by restricting thinking ability to symbolic systems alone Turing unknowingly constructed “the wall” that excludes any possi-bility of transition from a complex observable phenomenon to an abstract image or concept. It is, therefore, sensible to factor in new requirements for AI (artificial intelligence) maturity assessment when approaching the Tu-ring test. Such AI must support all forms of communication with a human being, and it should be able to comprehend abstract images and specify con-cepts as well as participate in social practices
Viewpoint-Free Photography for Virtual Reality
Viewpoint-free photography, i.e., interactively controlling the viewpoint of a photograph after capture, is a standing challenge. In this thesis, we investigate algorithms to enable viewpoint-free photography for virtual reality (VR) from casual capture, i.e., from footage easily captured with consumer cameras. We build on an extensive body of work in image-based rendering (IBR). Given images of an object or scene, IBR methods aim to predict the appearance of an image taken from a novel perspective. Most IBR methods focus on full or near-interpolation, where the output viewpoints either lie directly between captured images, or nearby. These methods are not suitable for VR, where the user has significant range of motion and can look in all directions. Thus, it is essential to create viewpoint-free photos with a wide field-of-view and sufficient positional freedom to cover the range of motion a user might experience in VR. We focus on two VR experiences: 1) Seated VR experiences, where the user can lean in different directions. This simplifies the problem, as the scene is only observed from a small range of viewpoints. Thus, we focus on easy capture, showing how to turn panorama-style capture into 3D photos, a simple representation for viewpoint-free photos, and also how to speed up processing so users can see the final result on-site. 2) Room-scale VR experiences, where the user can explore vastly different perspectives. This is challenging: More input footage is needed, maintaining real-time display rates becomes difficult, view-dependent appearance and object backsides need to be modelled, all while preventing noticeable mistakes. We address these challenges by: (1) creating refined geometry for each input photograph, (2) using a fast tiled rendering algorithm to achieve real-time display rates, and (3) using a convolutional neural network to hide visual mistakes during compositing. Overall, we provide evidence that viewpoint-free photography is feasible from casual capture. We thoroughly compare with the state-of-the-art, showing that our methods achieve both a numerical improvement and a clear increase in visual quality for both seated and room-scale VR experiences
End-to-end training of object class detectors for mean average precision
We present a method for training CNN-based object class detectors directly
using mean average precision (mAP) as the training loss, in a truly end-to-end
fashion that includes non-maximum suppression (NMS) at training time. This
contrasts with the traditional approach of training a CNN for a window
classification loss, then applying NMS only at test time, when mAP is used as
the evaluation metric in place of classification accuracy. However, mAP
following NMS forms a piecewise-constant structured loss over thousands of
windows, with gradients that do not convey useful information for gradient
descent. Hence, we define new, general gradient-like quantities for piecewise
constant functions, which have wide applicability. We describe how to calculate
these efficiently for mAP following NMS, enabling to train a detector based on
Fast R-CNN directly for mAP. This model achieves equivalent performance to the
standard Fast R-CNN on the PASCAL VOC 2007 and 2012 datasets, while being
conceptually more appealing as the very same model and loss are used at both
training and test time.Comment: This version has minor additions to results (ablation study) and
discussio