38,779 research outputs found
Towards Scene Understanding with Detailed 3D Object Representations
Current approaches to semantic image and scene understanding typically employ
rather simple object representations such as 2D or 3D bounding boxes. While
such coarse models are robust and allow for reliable object detection, they
discard much of the information about objects' 3D shape and pose, and thus do
not lend themselves well to higher-level reasoning. Here, we propose to base
scene understanding on a high-resolution object representation. An object class
- in our case cars - is modeled as a deformable 3D wireframe, which enables
fine-grained modeling at the level of individual vertices and faces. We augment
that model to explicitly include vertex-level occlusion, and embed all
instances in a common coordinate frame, in order to infer and exploit
object-object interactions. Specifically, from a single view we jointly
estimate the shapes and poses of multiple objects in a common 3D frame. A
ground plane in that frame is estimated by consensus among different objects,
which significantly stabilizes monocular 3D pose estimation. The fine-grained
model, in conjunction with the explicit 3D scene model, further allows one to
infer part-level occlusions between the modeled objects, as well as occlusions
by other, unmodeled scene elements. To demonstrate the benefits of such
detailed object class models in the context of scene understanding we
systematically evaluate our approach on the challenging KITTI street scene
dataset. The experiments show that the model's ability to utilize image
evidence at the level of individual parts improves monocular 3D pose estimation
w.r.t. both location and (continuous) viewpoint.Comment: International Journal of Computer Vision (appeared online on 4
November 2014). Online version:
http://link.springer.com/article/10.1007/s11263-014-0780-
Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene
The goal of this paper is to take a single 2D image of a scene and recover
the 3D structure in terms of a small set of factors: a layout representing the
enclosing surfaces as well as a set of objects represented in terms of shape
and pose. We propose a convolutional neural network-based approach to predict
this representation and benchmark it on a large dataset of indoor scenes. Our
experiments evaluate a number of practical design questions, demonstrate that
we can infer this representation, and quantitatively and qualitatively
demonstrate its merits compared to alternate representations.Comment: Project url with code: https://shubhtuls.github.io/factored3
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
De/construction sites: Romans and the digital playground
The Roman world as attested to archaeologically and as interacted with today has its expression in a great many computational and other media. The place of visualisation within this has been paramount. This paper argues that the process of digitally constructing the Roman world and the exploration of the resultant models are useful methods for interpretation and influential factors in the creation of a popular Roman aesthetic. Furthermore, it suggests ways in which novel computational techniques enable the systematic deconstruction of such models, in turn re-purposing the many extant representations of Roman architecture and material culture
HoME: a Household Multimodal Environment
We introduce HoME: a Household Multimodal Environment for artificial agents
to learn from vision, audio, semantics, physics, and interaction with objects
and other agents, all within a realistic context. HoME integrates over 45,000
diverse 3D house layouts based on the SUNCG dataset, a scale which may
facilitate learning, generalization, and transfer. HoME is an open-source,
OpenAI Gym-compatible platform extensible to tasks in reinforcement learning,
language grounding, sound-based navigation, robotics, multi-agent learning, and
more. We hope HoME better enables artificial agents to learn as humans do: in
an interactive, multimodal, and richly contextualized setting.Comment: Presented at NIPS 2017's Visually-Grounded Interaction and Language
Worksho
- …