14,151 research outputs found
Streaming and User Behaviour in Omnidirectional Videos
Omnidirectional videos (ODVs) have gone beyond the passive paradigm of traditional video,
offering higher degrees of immersion and interaction. The revolutionary novelty of this technology is the possibility for users to interact with the surrounding environment, and to feel a
sense of engagement and presence in a virtual space. Users are clearly the main driving force of
immersive applications and consequentially the services need to be properly tailored to them.
In this context, this chapter highlights the importance of the new role of users in ODV streaming applications, and thus the need for understanding their behaviour while navigating within
ODVs. A comprehensive overview of the research efforts aimed at advancing ODV streaming
systems is also presented. In particular, the state-of-the-art solutions under examination in this
chapter are distinguished in terms of system-centric and user-centric streaming approaches: the
former approach comes from a quite straightforward extension of well-established solutions for
the 2D video pipeline while the latter one takes the benefit of understanding users’ behaviour
and enable more personalised ODV streaming
Self-Supervised Relative Depth Learning for Urban Scene Understanding
As an agent moves through the world, the apparent motion of scene elements is
(usually) inversely proportional to their depth. It is natural for a learning
agent to associate image patterns with the magnitude of their displacement over
time: as the agent moves, faraway mountains don't move much; nearby trees move
a lot. This natural relationship between the appearance of objects and their
motion is a rich source of information about the world. In this work, we start
by training a deep network, using fully automatic supervision, to predict
relative scene depth from single images. The relative depth training images are
automatically derived from simple videos of cars moving through a scene, using
recent motion segmentation techniques, and no human-provided labels. This proxy
task of predicting relative depth from a single image induces features in the
network that result in large improvements in a set of downstream tasks
including semantic segmentation, joint road segmentation and car detection, and
monocular (absolute) depth estimation, over a network trained from scratch. The
improvement on the semantic segmentation task is greater than those produced by
any other automatically supervised methods. Moreover, for monocular depth
estimation, our unsupervised pre-training method even outperforms supervised
pre-training with ImageNet. In addition, we demonstrate benefits from learning
to predict (unsupervised) relative depth in the specific videos associated with
various downstream tasks. We adapt to the specific scenes in those tasks in an
unsupervised manner to improve performance. In summary, for semantic
segmentation, we present state-of-the-art results among methods that do not use
supervised pre-training, and we even exceed the performance of supervised
ImageNet pre-trained models for monocular depth estimation, achieving results
that are comparable with state-of-the-art methods
Aria-NeRF: Multimodal Egocentric View Synthesis
We seek to accelerate research in developing rich, multimodal scene models
trained from egocentric data, based on differentiable volumetric ray-tracing
inspired by Neural Radiance Fields (NeRFs). The construction of a NeRF-like
model from an egocentric image sequence plays a pivotal role in understanding
human behavior and holds diverse applications within the realms of VR/AR. Such
egocentric NeRF-like models may be used as realistic simulations, contributing
significantly to the advancement of intelligent agents capable of executing
tasks in the real-world. The future of egocentric view synthesis may lead to
novel environment representations going beyond today's NeRFs by augmenting
visual data with multimodal sensors such as IMU for egomotion tracking, audio
sensors to capture surface texture and human language context, and eye-gaze
trackers to infer human attention patterns in the scene. To support and
facilitate the development and evaluation of egocentric multimodal scene
modeling, we present a comprehensive multimodal egocentric video dataset. This
dataset offers a comprehensive collection of sensory data, featuring RGB
images, eye-tracking camera footage, audio recordings from a microphone,
atmospheric pressure readings from a barometer, positional coordinates from
GPS, connectivity details from Wi-Fi and Bluetooth, and information from
dual-frequency IMU datasets (1kHz and 800Hz) paired with a magnetometer. The
dataset was collected with the Meta Aria Glasses wearable device platform. The
diverse data modalities and the real-world context captured within this dataset
serve as a robust foundation for furthering our understanding of human behavior
and enabling more immersive and intelligent experiences in the realms of VR,
AR, and robotics
Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement
Human perception reliably identifies movable and immovable parts of 3D
scenes, and completes the 3D structure of objects and background from
incomplete observations. We learn this skill not via labeled examples, but
simply by observing objects move. In this work, we propose an approach that
observes unlabeled multi-view videos at training time and learns to map a
single image observation of a complex scene, such as a street with cars, to a
3D neural scene representation that is disentangled into movable and immovable
parts while plausibly completing its 3D structure. We separately parameterize
movable and immovable scene parts via 2D neural ground plans. These ground
plans are 2D grids of features aligned with the ground plane that can be
locally decoded into 3D neural radiance fields. Our model is trained
self-supervised via neural rendering. We demonstrate that the structure
inherent to our disentangled 3D representation enables a variety of downstream
tasks in street-scale 3D scenes using simple heuristics, such as extraction of
object-centric 3D representations, novel view synthesis, instance segmentation,
and 3D bounding box prediction, highlighting its value as a backbone for
data-efficient 3D scene understanding models. This disentanglement further
enables scene editing via object manipulation such as deletion, insertion, and
rigid-body motion.Comment: Project page: https://prafullsharma.net/see3d
Vision for Social Robots: Human Perception and Pose Estimation
In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene.
The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention.
First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error.
Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images.
Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p
- …