10 research outputs found
WinDB: HMD-free and Distortion-free Panoptic Video Fixation Learning
To date, the widely-adopted way to perform fixation collection in panoptic
video is based on a head-mounted display (HMD), where participants' fixations
are collected while wearing an HMD to explore the given panoptic scene freely.
However, this widely-used data collection method is insufficient for training
deep models to accurately predict which regions in a given panoptic are most
important when it contains intermittent salient events. The main reason is that
there always exist "blind zooms" when using HMD to collect fixations since the
participants cannot keep spinning their heads to explore the entire panoptic
scene all the time. Consequently, the collected fixations tend to be trapped in
some local views, leaving the remaining areas to be the "blind zooms".
Therefore, fixation data collected using HMD-based methods that accumulate
local views cannot accurately represent the overall global importance of
complex panoramic scenes. This paper introduces the auxiliary Window with a
Dynamic Blurring (WinDB) fixation collection approach for panoptic video, which
doesn't need HMD and is blind-zoom-free. Thus, the collected fixations can well
reflect the regional-wise importance degree. Using our WinDB approach, we have
released a new PanopticVideo-300 dataset, containing 300 panoptic clips
covering over 225 categories. Besides, we have presented a simple baseline
design to take full advantage of PanopticVideo-300 to handle the
blind-zoom-free attribute-induced fixation shifting problem
Deep Reinforcement Learning for Active Human Pose Estimation
Most 3d human pose estimation methods assume that input -- be it images of a
scene collected from one or several viewpoints, or from a video -- is given.
Consequently, they focus on estimates leveraging prior knowledge and
measurement by fusing information spatially and/or temporally, whenever
available. In this paper we address the problem of an active observer with
freedom to move and explore the scene spatially -- in `time-freeze' mode --
and/or temporally, by selecting informative viewpoints that improve its
estimation accuracy. Towards this end, we introduce Pose-DRL, a fully trainable
deep reinforcement learning-based active pose estimation architecture which
learns to select appropriate views, in space and time, to feed an underlying
monocular pose estimator. We evaluate our model using single- and multi-target
estimators with strong result in both settings. Our system further learns
automatic stopping conditions in time and transition functions to the next
temporal processing step in videos. In extensive experiments with the Panoptic
multi-view setup, and for complex scenes containing multiple people, we show
that our model learns to select viewpoints that yield significantly more
accurate pose estimates compared to strong multi-view baselines.Comment: Accepted to The Thirty-Fourth AAAI Conference on Artificial
Intelligence (AAAI-20). Submission updated to include supplementary materia
Reinforcement Learning for Active Visual Perception
Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well
Active and Physics-Based Human Pose Reconstruction
Perceiving humans is an important and complex problem within computervision. Its significance is derived from its numerous applications, suchas human-robot interaction, virtual reality, markerless motion capture,and human tracking for autonomous driving. The difficulty lies in thevariability in human appearance, physique, and plausible body poses. Inreal-world scenes, this is further exacerbated by difficult lightingconditions, partial occlusions, and the depth ambiguity stemming fromthe loss of information during the 3d to 2d projection. Despite thesechallenges, significant progress has been made in recent years,primarily due to the expressive power of deep neural networks trained onlarge datasets. However, creating large-scale datasets with 3dannotations is expensive, and capturing the vast diversity of the realworld is demanding. Traditionally, 3d ground truth is captured usingmotion capture laboratories that require large investments. Furthermore,many laboratories cannot easily accommodate athletic and dynamicmotions. This thesis studies three approaches to improving visualperception, with emphasis on human pose estimation, that can complementimprovements to the underlying predictor or training data.The first two papers present active human pose estimation, where areinforcement learning agent is tasked with selecting informativeviewpoints to reconstruct subjects efficiently. The papers discard thecommon assumption that the input is given and instead allow the agent tomove to observe subjects from desirable viewpoints, e.g., those whichavoid occlusions and for which the underlying pose estimator has a lowprediction error.The third paper introduces the task of embodied visual active learning,which goes further and assumes that the perceptual model is notpre-trained. Instead, the agent is tasked with exploring its environmentand requesting annotations to refine its visual model. Learning toexplore novel scenarios and efficiently request annotation for new datais a step towards life-long learning, where models can evolve beyondwhat they learned during the initial training phase. We study theproblem for segmentation, though the idea is applicable to otherperception tasks.Lastly, the final two papers propose improving human pose estimation byintegrating physical constraints. These regularize the reconstructedmotions to be physically plausible and serve as a complement to currentkinematic approaches. Whether a motion has been observed in the trainingdata or not, the predictions should obey the laws of physics. Throughintegration with a physical simulator, we demonstrate that we can reducereconstruction artifacts and enforce, e.g., contact constraints
Data-Efficient Learning of Semantic Segmentation
Semantic segmentation is a fundamental problem in visual perception with a wide range of applications ranging from robotics to autonomous vehicles, and recent approaches based on deep learning have achieved excellent performance. However, to train such systems there is in general a need for very large datasets of annotated images. In this thesis we investigate and propose methods and setups for which it is possible to use unlabelled data to increase the performance or to use limited application specific data to reduce the need for large datasets when learning semantic segmentation.In the first paper we study semantic video segmentation. We present a deep end-to-end trainable model that uses propagated labelling information in unlabelled frames in addition to sparsely labelled frames to predict semantic segmentation. Extensive experiments on the CityScapes and CamVid datasets show that the model can improve accuracy and temporal consistency by using extra unlabelled video frames in training and testing.In the second, third and fourth paper we study active learning for semantic segmentation in an embodied context where navigation is part of the problem. A navigable agent should explore a building and query for the labelling of informative views that increase the visual perception of the agent. In the second paper we introduce the embodied visual active learning problem, and propose and evaluate a range of methods from heuristic baselines to a fully trainable agent using reinforcement learning (RL) on the Matterport3D dataset. We show that the learned agent outperforms several comparable pre-specified baselines. In the third paper we study the embodied visual active learning problem in a lifelong setup, where the visual learning spans the exploration of multiple buildings, and the learning in one scene should influence the active learning in the next e.g. by not annotating already accurately segmented object classes. We introduce new methodology to encourage global exploration of scenes, via an RL-formulation that combines local navigation with global exploration by frontier exploration. We show that the RL-agent can learn adaptable behaviour such as annotating less frequently when it already has explored a number of buildings. Finally we study the embodied visual active learning problem with region-based active learning in the fourth paper. Instead of querying for annotations for a whole image, an agent can query for annotations of just parts of images, and we show that it is significantly more labelling efficient to just annotate regions in the image instead of the full images
Mitigating Distortion to Enable 360° Computer Vision
For tasks on central-perspective images, convolutional neural networks have been a revolutionary innovation. However, their performance degrades as the amount of geometric image distortion increases. This limitation is particularly evident for 360° images. These images capture a 180° x 360° field of view by replacing the imaging plane with the concept of an imaging sphere. Because there is no isometric mapping from this spherical capture format to a planar image representation, all 360° images necessarily suffer from some degree of geometric image distortion, which manifests as local content deformation. This corruptive effect hinders the ability of these groundbreaking computer vision algorithms to enable 360° computer vision, resulting in a performance gap between networks applied to central-perspective images and those applied to spherical images. This dissertation seeks to better understand the impact that geometric distortion has on convolutional neural networks and to identify spherical image representations that can mitigate its effect. This work argues that there are three requisite properties of any general solution: distortion-mitigation, transferability, and scalability. Bridging the performance gap requires reducing distortion in the image representation, developing tools to directly apply central-perspective image algorithms to spherical data, and ensuring that these algorithms can efficiently process high resolution spherical images. Drawing insight from the field of cartography, the subdivided regular icosahedron is proposed as a low-distortion alternative to the commonly used equirectangular and cube map spherical image formats. To address the non-Euclidean nature of this representation, a generalization of the standard convolution operation is proposed to map the standard convolutional kernel grid to the structure of any spherical representation. Finally, a new representation is proposed. Derived from the icosahedron, it represents a spherical image as a set of square, oriented, planar pixel grids rendered tangent to the sphere at the center of each face of the icosahedron. These "tangent images" satisfy all three requisite properties, offering a promising, general solution to the spherical image problem.Doctor of Philosoph
Recommended from our members
Learning to compose photos and videos from passive cameras
Photo and video overload is well-known to most computer users. With cameras on mobile devices, it is all too easy to snap images and videos spontaneously, yet it remains much less easy to organize or search through that content later. With increasingly portable wearable and 360° computing platforms, the overload problem is only intensifying. Wearable and 360° cameras passively record everything they observe, unlike traditional cameras that require active human attention to capture images or videos.
In my thesis, I explore the idea of automatically composing photos and videos from unedited videos captured by "passive" cameras. Passive cameras (e.g., wearable cameras, 360° cameras) offer a more relaxing experience to record our visual world but they do not always capture frames that look like intentional human-taken photos. In wearable cameras, many frames will be blurry, contain poorly composed shots, and/or simply have uninteresting content. In 360° cameras, a single omni-directional image captures the entire visual world, and the photographer's intention and attention in that moment are unknown. To this end, I consider the following problems in the context of passive cameras: 1) what visual data to capture and store, 2) how to identify foreground objects, and 3) how to enhance the viewing experience.
First, I explore the problem of finding the best moments in unedited videos. Not everything observed in a wearable camera's video stream is worthy of being captured and stored. People can easily distinguish well-composed moments from accidental shots from a wearable camera. This prompts the question: can a vision system predict the best moments in unedited video? I first study how to find the best moments in terms of short video clips. My key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, I introduce a novel ranking framework to learn video highlight detection from unlabeled videos. Next, I show how to predict snap points in unedited video---that is, those frames that look like intentionally taken photos. I propose a framework to detect snap points that requires no human annotations. The main idea is to construct a generative model of what human-taken photos look like by sampling images posted on the Web. Snapshots that people upload to share publicly online may vary vastly in their content, yet all share the key facet that they were intentional snap point moments. This makes them an ideal source of positive exemplars for our target learning problem. In both settings, despite learning without any explicit labels, my proposed models outperform discriminative baselines trained with labeled data.
Next, I introduce a novel approach to automatically segment foreground objects in images and videos. Identifying key objects is an important intermediate step for automatic photo composition. It is also a prerequisite in graphics applications like image retargeting, production video editing, and rotoscoping. Given an image or video frame, the goal is to determine the likelihood that each pixel is part of a foreground object. I formulate the task as a structured prediction problem of assigning an object/background label to each pixel (pixel objectness), and I propose an end-to-end trainable model that draws on the respective strengths of generic object appearance and motion in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, I show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. In addition, I demonstrate how the proposed approach benefits image retrieval and image retargeting. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art results for fully automatic segmentation of foreground objects.
Building on the proposed foreground segmentation method, I finally explore how to predict viewing angles to enhance photo composition after identifying those foreground objects. Specifically, I introduce snap angle prediction for 360° panoramas, which are a rich medium, yet notoriously difficult to visualize in the 2D image plane. I explore how intelligent rotations of a spherical image may enable content-aware projection with fewer perceptible distortions. Whereas existing approaches assume the viewpoint is fixed, intuitively some viewing angles within the sphere preserve high-level objects better than others. To discover the relationship between these optimal snap angles and the spherical panorama's content, I develop a reinforcement learning approach for the cubemap projection model. Implemented as a deep recurrent neural network, our method selects a sequence of rotation actions and receives reward for avoiding cube boundaries that overlap with important foreground objects. Our proposed method offers a 5x speedup compared to exhaustive search.
Throughout, I validate the strength of the proposed frameworks on multiple challenging datasets against a variety of previously established state-of-the-art methods and other pertinent baselines. Our experiments demonstrate the following: 1) our method can automatically identify the best moments from unedited videos; 2) our segmentation method substantially improves the state-of-the-art on foreground segmentation in images and videos and also benefits automatic photo composition; 3) our viewing angle prediction for 360° imagery can enhance the viewing experience. Although my thesis mainly focuses on passive cameras, a portion of the proposed methods are also applicable to general user generated images and videos.Computer Science