21,448 research outputs found
Joint learning of object and action detectors
International audienceWhile most existing approaches for detection in videos focus on objects or human actions separately, we aim at jointly detecting objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting objects-actions in videos, and show that both tasks of object and action detection benefit from this joint learning. Moreover, the proposed architecture can be used for zero-shot learning of actions: our multitask objective leverages the commonalities of an action performed by different objects, e.g. dog and cat jumping , enabling to detect actions of an object without training with these object-actions pairs. In experiments on the A2D dataset [50], we obtain state-of-the-art results on segmentation of object-action pairs. We finally apply our multitask architecture to detect visual relationships between objects in images of the VRD dataset [24]
Localizing spatially and temporally objects and actions in videos
The rise of deep learning has facilitated remarkable progress in video understanding.
This thesis addresses three important tasks of video understanding: video object detection,
joint object and action detection, and spatio-temporal action localization.
Object class detection is one of the most important challenges in computer vision.
Object detectors are usually trained on bounding-boxes from still images. Recently,
video has been used as an alternative source of data. Yet, training an object detector
on one domain (either still images or videos) and testing on the other one results in a
significant performance gap compared to training and testing on the same domain. In
the first part of this thesis, we examine the reasons behind this performance gap. We
define and evaluate several domain shift factors: spatial location accuracy, appearance
diversity, image quality, aspect distribution, and object size and camera framing. We
examine the impact of these factors by comparing the detection performance before
and after cancelling them out. The results show that all five factors affect the performance
of the detectors and their combined effect explains the performance gap.
While most existing approaches for detection in videos focus on objects or human
actions separately, in the second part of this thesis we aim at detecting non-human
centric actions, i.e., objects performing actions, such as cat eating or dog jumping. We
introduce an end-to-end multitask objective that jointly learns object-action relationships.
We compare it with different training objectives, validate its effectiveness for
detecting object-action pairs in videos, and show that both tasks of object and action
detection benefit from this joint learning. In experiments on the A2D dataset [Xu et al.,
2015], we obtain state-of-the-art results on segmentation of object-action pairs.
In the third part, we are the first to propose an action tubelet detector that leverages
the temporal continuity of videos instead of operating at the frame level, as state-of-the-art approaches do. The same way modern detectors rely on anchor boxes, our
tubelet detector is based on anchor cuboids by taking as input a sequence of frames
and outputing tubelets, i.e., sequences of bounding boxes with associated scores. Our
tubelet detector outperforms all state of the art on the UCF-Sports [Rodriguez et al.,
2008], J-HMDB [Jhuang et al., 2013a], and UCF-101 [Soomro et al., 2012] action localization
datasets especially at high overlap thresholds. The improvement in detection
performance is explained by both more accurate scores and more precise localization
No Spare Parts: Sharing Part Detectors for Image Categorization
This work aims for image categorization using a representation of distinctive
parts. Different from existing part-based work, we argue that parts are
naturally shared between image categories and should be modeled as such. We
motivate our approach with a quantitative and qualitative analysis by
backtracking where selected parts come from. Our analysis shows that in
addition to the category parts defining the class, the parts coming from the
background context and parts from other image categories improve categorization
performance. Part selection should not be done separately for each category,
but instead be shared and optimized over all categories. To incorporate part
sharing between categories, we present an algorithm based on AdaBoost to
jointly optimize part sharing and selection, as well as fusion with the global
image representation. We achieve results competitive to the state-of-the-art on
object, scene, and action categories, further improving over deep convolutional
neural networks
Programmable Agents
We build deep RL agents that execute declarative programs expressed in formal
language. The agents learn to ground the terms in this language in their
environment, and can generalize their behavior at test time to execute new
programs that refer to objects that were not referenced during training. The
agents develop disentangled interpretable representations that allow them to
generalize to a wide variety of zero-shot semantic tasks
- …