502 research outputs found
View Synthesis of Dynamic Scenes based on Deep 3D Mask Volume
Image view synthesis has seen great success in reconstructing photorealistic visuals, thanks to deep learning and various novel representations. The next key step in immersive virtual experiences is view synthesis of dynamic scenes. However, several challenges exist due to the lack of high-quality training datasets, and the additional time dimension for videos of dynamic scenes. To address this issue, we introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS. The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes. We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras. Our algorithm addresses the temporal inconsistency of disocclusions by identifying the error-prone areas with a 3D mask volume, and replaces them with static background observed throughout the video. Our method enables manipulation in 3D space as opposed to simple 2D masks, We demonstrate better temporal stability than frame-by-frame static view synthesis methods, or those that use 2D masks. The resulting view synthesis videos show minimal flickering artifacts and allow for larger translational movements
Recommended from our members
Shell-Based Geometric Image and Video Inpainting
The subject of this thesis is a class of fast inpainting methods (image or video) based on the idea of filling the inpainting domain in successive shells from its boundary inwards. Image pixels (or video voxels) are filled by assigning them a color equal to a weighted average of either their already filled neighbors (the ``direct'' form of the method) or those neighbors plus additional neighbors within the current shell (the ``semi-implicit'' form). In the direct form, pixels (voxels) in the current shell may be filled independently, but in the semi-implicit form they are filled simultaneously by solving a linear system. We focus in this thesis mainly on the image inpainting case, where the literature contains several methods corresponding to the {\em direct} form of the method - the semi-implicit form is introduced for the first time here. These methods effectively differ only in the order in which pixels (voxels) are filled, the weights used for averaging, and the neighborhood that is averaged over. All of them are very fast, but at the same time all of them leave undesirable artifacts such as ``kinking'' (bending) or blurring of extrapolated isophotes.
This thesis has two main goals. First, we introduce new algorithms within this class, which are aimed at reducing or eliminating these artifacts, and also target a specific application - the 3D conversion of images and film. The first part of this thesis will be concerned with introducing 3D conversion as well as Guidefill, a method in the above class adapted to the inpainting problems arising in 3D conversion. However, the second and more significant goal of this thesis is to study these algorithms as a class. In particular, we develop a mathematical theory aimed at understanding the origins of artifacts mentioned. Through this, we seek is to understand which artifacts can be eliminated (and how), and which artifacts are inevitable (and why). Most of the thesis is occupied with this second goal.
Our theory is based on two separate limits - the first is a {\em continuum} limit, in which the pixel width , and in which the algorithm converges to a partial differential equation. The second is an asymptotic limit in which is very small but non-zero. This latter limit, which is based on a connection to random walks, relates the inpainted solution to a type of discrete convolution. The former is useful for studying kinking artifacts, while the latter is useful for studying blur. Although all the theoretical work has been done in the context of image inpainting, experimental evidence is presented suggesting a simple generalization to video.
Finally, in the last part of the thesis we explore shell-based video inpainting. In particular, we introduce spacetime transport, which is a natural generalization of the ideas of Guidefill and its predecessor, coherence transport, to three dimensions (two spatial dimensions plus one time dimension). Spacetime transport is shown to have much in common with shell-based image inpainting methods. In particular, kinking and blur artifacts persist, and the former of these may be alleviated in exactly the same way as in two dimensions. At the same time, spacetime transport is shown to be related to optical flow based video inpainting. In particular, a connection is derived between spacetime transport and a generalized LucasâKanade optical flow that does not distinguish between time and space.Cambridge Overseas Scholarshi
Recommended from our members
Guidefill: GPU accelerated, artist guided geometric inpainting for 3D conversion of film
The conversion of traditional film into stereo 3D has become an important problem in the past decade. One of the main bottlenecks is a disocclusion step, which in commercial 3D conversion is usually done by teams of artists armed with a toolbox of inpainting algorithms. A current difficulty in this is that most available algorithms either are too slow for interactive use or provide no intuitive means for users to tweak the output. In this paper we present a new fast inpainting algorithm based on transporting along automatically detected splines, which the user may edit. Our algorithm is implemented on the GPU and fills the inpainting domain in successive shells that adapt their shape on the y. In order to allocate GPU resources as efficiently as possible, we propose a parallel algorithm to track the inpainting interface as it evolves, ensuring that no resources are wasted on pixels that are not currently being worked on. Theoretical analyses of the time and processor complexity of our algorithm without and with tracking (as well as numerous numerical experiments) demonstrate the merits of the latter. Our transport mechanism is similar to the one used in coherence transport [F. Bornemann and T. MĂ€rz, J. Math. Imaging Vision, 28 (2007), pp. 259-278; T. MĂ€rz, SIAM J. Imaging Sci., 4 (2011), pp. 981-1000] but improves upon it by correcting a \kinking" phenomenon whereby extrapolated isophotes may bend at the boundary of the inpainting domain. Theoretical results explaining this phenomenon and its resolution are presented. Although our method ignores texture, in many cases this is not a problem due to the thin inpainting domains in 3D conversion. Experimental results show that our method can achieve a visual quality that is competitive with the state of the art while maintaining interactive speeds and providing the user with an intuitive interface to tweak the results.The work of the first author was supported by the Cambridge Commonwealth Trust and the Cambridge Center for Analysis. The work of the third author was supported by the Leverhulme Trust project Breaking the Nonconvexity Barrier, the EPSRC grants EP/M00483X/1 and EP/N014588/1, the Cantab Capital Institute for the Mathematics of Information, the CHiPS (Horizon 2020 RISE project grant), the Global Alliance project âStatistical and Mathematical Theory of Imaging,â and the Alan Turing Institute
Discontinuity-Aware Base-Mesh Modeling of Depth for Scalable Multiview Image Synthesis and Compression
This thesis is concerned with the challenge of deriving disparity from sparsely communicated depth for performing disparity-compensated view synthesis for compression and rendering of multiview images. The modeling of depth is essential for deducing disparity at view locations where depth is not available and is also critical for visibility reasoning and occlusion handling.
This thesis first explores disparity derivation methods and disparity-compensated view synthesis approaches. Investigations reveal the merits of adopting a piece-wise continuous mesh description of depth for deriving disparity at target view locations to enable disparity-compensated backward warping of texture. Visibility information can be reasoned due to the correspondence relationship between views that a mesh model provides, while the connectivity of a mesh model assists in resolving depth occlusion.
The recent JPEG 2000 Part-17 extension defines tools for scalable coding of discontinuous media using breakpoint-dependent DWT, where breakpoints describe discontinuity boundary geometry. This thesis proposes a method to efficiently reconstruct depth coded using JPEG 2000 Part-17 as a piece-wise continuous mesh, where discontinuities are driven by the encoded breakpoints. Results show that the proposed mesh can accurately represent decoded depth while its complexity scales along with decoded depth quality.
The piece-wise continuous mesh model anchored at a single viewpoint or base-view can be augmented to form a multi-layered structure where the underlying layers carry depth information of regions that are occluded at the base-view. Such a consolidated mesh representation is termed a base-mesh model and can be projected to many viewpoints, to deduce complete disparity fields between any pair of views that are inherently consistent. Experimental results demonstrate the superior performance of the base-mesh model in multiview synthesis and compression compared to other state-of-the-art methods, including the JPEG Pleno light field codec. The proposed base-mesh model departs greatly from conventional pixel-wise or block-wise depth models and their forward depth mapping for deriving disparity ingrained in existing multiview processing systems.
When performing disparity-compensated view synthesis, there can be regions for which reference texture is unavailable, and inpainting is required. A new depth-guided texture inpainting algorithm is proposed to restore occluded texture in regions where depth information is either available or can be inferred using the base-mesh model
Robust density modelling using the student's t-distribution for human action recognition
The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is highly sensitive to outliers. The Gaussian distribution is also often used as base component of graphical models for recognising human actions in the videos (hidden Markov model and others) and the presence of outliers can significantly affect the recognition accuracy. In contrast, the Student's t-distribution is more robust to outliers and can be exploited to improve the recognition rate in the presence of abnormal data. In this paper, we present an HMM which uses mixtures of t-distributions as observation probabilities and show how experiments over two well-known datasets (Weizmann, MuHAVi) reported a remarkable improvement in classification accuracy. © 2011 IEEE
Change blindness: eradication of gestalt strategies
Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149â164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task
Agricultural Structures and Mechanization
In our globalized world, the need to produce quality and safe food has increased exponentially in recent decades to meet the growing demands of the world population. This expectation is being met by acting at multiple levels, but mainly through the introduction of new technologies in the agricultural and agri-food sectors. In this context, agricultural, livestock, agro-industrial buildings, and agrarian infrastructure are being built on the basis of a sophisticated design that integrates environmental, landscape, and occupational safety, new construction materials, new facilities, and mechanization with state-of-the-art automatic systems, using calculation models and computer programs. It is necessary to promote research and dissemination of results in the field of mechanization and agricultural structures, specifically with regard to farm building and rural landscape, land and water use and environment, power and machinery, information systems and precision farming, processing and post-harvest technology and logistics, energy and non-food production technology, systems engineering and management, and fruit and vegetable cultivation systems. This Special Issue focuses on the role that mechanization and agricultural structures play in the production of high-quality food and continuously over time. For this reason, it publishes highly interdisciplinary quality studies from disparate research fields including agriculture, engineering design, calculation and modeling, landscaping, environmentalism, and even ergonomics and occupational risk prevention
Recommended from our members
Learning human activities and poses with interconnected data sources
Understanding human actions and poses in images or videos is a challenging problem in computer vision. There are different topics related to this problem such as action recognition, pose estimation, human-object interaction, and activity detection. Knowledge of actions and poses could benefit many applications, including video search, surveillance, auto-tagging, event detection, and human-computer interfaces. To understand humans' actions and poses, we need to address several challenges. First, humans are able to perform an enormous amount of poses. For example, simply to move forward, we can do crawling, walking, running, and sprinting. These poses all look different and require examples to cover these variations. Second, the appearance of a person's pose changes when looking from different viewing angles. The learned action model needs to cover the variations from different views. Third, many actions involve interactions between people and other objects, so we need to consider the appearance change corresponding to that object as well. Fourth, collecting such data for learning is difficult and expensive. Last, even if we can learn a good model for an action, to localize when and where the action happens in a long video remains a difficult problem due to the large search space. My key idea to alleviate these obstacles in learning humans' actions and poses is to discover the underlying patterns that connect the information from different data sources. Why will there be underlying patterns? The intuition is that all people share the same articulated physical structure. Though we can change our pose, there are common regulations that limit how our pose can be and how it can move over time. Therefore, all types of human data will follow these rules and they can serve as prior knowledge or regularization in our learning framework. If we can exploit these tendencies, we are able to extract additional information from data and use them to improve learning of humans' actions and poses. In particular, we are able to find patterns for how our pose could vary over time, how our appearance looks in a specific view, how our pose is when we are interacting with objects with certain properties, and how part of our body configuration is shared across different poses. If we could learn these patterns, they can be used to interconnect and extrapolate the knowledge between different data sources. To this end, I propose several new ways to connect human activity data. First, I show how to connect snapshot images and videos by exploring the patterns of how our pose could change over time. Building on this idea, I explore how to connect humans' poses across multiple views by discovering the correlations between different poses and the latent factors that affect the viewpoint variations. In addition, I consider if there are also patterns connecting our poses and nearby objects when we are interacting with them. Furthermore, I explore how we can utilize the predicted interaction as a cue to better address existing recognition problems including image re-targeting and image description generation. Finally, after learning models effectively incorporating these patterns, I propose a robust approach to efficiently localize when and where a complex action happens in a video sequence. The variants of my proposed approaches offer a good trade-off between computational cost and detection accuracy. My thesis exploits various types of underlying patterns in human data. The discovered structure is used to enhance the understanding of humans' actions and poses. By my proposed methods, we are able to 1) learn an action with very few snapshots by connecting them to a pool of label-free videos, 2) infer the pose for some views even without any examples by connecting the latent factors between different views, 3) predict the location of an object that a person is interacting with independent of the type and appearance of that object, then use the inferred interaction as a cue to improve recognition, and 4) localize an action in a complex long video. These approaches improve existing frameworks for understanding humans' actions and poses without extra data collection cost and broaden the problems that we can tackle.Computer Science
- âŠ