458 research outputs found

    Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting

    Full text link
    Today, people are consuming more videos than ever before. At the same time, video manipulation has rapidly been gaining traction due to the influence of viral videos, as well as the convenience of editing software. Although video manipulation has legitimate entertainment purposes, it can also be incredibly destructive. In order to understand the positive and negative consequences of media manipulation---as well as to maintain the integrity of mass media---it is important to investigate the capabilities of video manipulation techniques. In this dissertation, we focus on the manipulation task of video inpainting, where the goal is to automatically fill in missing parts of a masked video with semantically relevant content. Inpainting results should possess high visual quality with respect to reconstruction performance, realism, and temporal consistency, i.e., they should faithfully recreate missing contents in a way that resembles the real world and exhibits minimal flickering artifacts. Two major challenges have impeded progress toward improving visual quality: semantic ambiguity and diagnostic evaluation. Semantic ambiguity exists for any masked video due to several plausible explanations of the events in the observed scene; however, prior methods have struggled with ambiguity due to their limited temporal contexts. As for diagnostic evaluation, prior work has overemphasized aggregate analysis on large datasets and underemphasized fine-grained analysis on modern inpainting failure modes; as a result, the expected behaviors of models under specific scenarios have remained poorly understood. Our work improves on both models and evaluation techniques for video inpainting, thereby providing deeper insight into how an inpainting model's design impacts the visual quality of its outputs. To advance state-of-the-art in video inpainting, we propose two novel solutions that improve visual quality by expanding the available temporal context. Our first approach, bi-TAI, intelligently integrates information from multiple frames before and after the desired sequence. It produces more realistic results than prior work, which could only consume limited contextual information. Our second approach, HyperCon, suppresses flickering artifacts from frame-wise processing by identifying and propagating consistencies found in high frame-rate space; we successfully apply it to tasks as disparate as video inpainting and style transfer. Aside from methodological improvements, we also propose two novel evaluation tools to diagnose failure modes of modern video inpainting methods. Our first such contribution is the Moving Symbols dataset, which we use to characterize the sensitivity of a state-of-the-art video prediction model to controllable appearance and motion parameters. Our second contribution is the DEVIL benchmark, which provides a dataset and a comprehensive evaluation scheme to quantify how several semantic properties of the input video and mask affect video inpainting quality. Through models that exploit temporal context---as well as evaluation paradigms that reveal fine-grained failure modes of modern inpainting methods at scale---our contributions enforce better visual quality for video inpainting on a larger scale than prior work. We enable the production of more convincing manipulated videos for data processing and social media needs; we also establish replicable fine-grained analysis techniques to cultivate future progress in the field.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169785/1/szetor_1.pd

    Image-Based Rendering Of Real Environments For Virtual Reality

    Get PDF

    Applying image processing techniques to pose estimation and view synthesis.

    Get PDF
    Fung Yiu-fai Phineas.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 142-148).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Model-based Pose Estimation --- p.3Chapter 1.1.1 --- Application - 3D Motion Tracking --- p.4Chapter 1.2 --- Image-based View Synthesis --- p.4Chapter 1.3 --- Thesis Contribution --- p.7Chapter 1.4 --- Thesis Outline --- p.8Chapter 2 --- General Background --- p.9Chapter 2.1 --- Notations --- p.9Chapter 2.2 --- Camera Models --- p.10Chapter 2.2.1 --- Generic Camera Model --- p.10Chapter 2.2.2 --- Full-perspective Camera Model --- p.11Chapter 2.2.3 --- Affine Camera Model --- p.12Chapter 2.2.4 --- Weak-perspective Camera Model --- p.13Chapter 2.2.5 --- Paraperspective Camera Model --- p.14Chapter 2.3 --- Model-based Motion Analysis --- p.15Chapter 2.3.1 --- Point Correspondences --- p.16Chapter 2.3.2 --- Line Correspondences --- p.18Chapter 2.3.3 --- Angle Correspondences --- p.19Chapter 2.4 --- Panoramic Representation --- p.20Chapter 2.4.1 --- Static Mosaic --- p.21Chapter 2.4.2 --- Dynamic Mosaic --- p.22Chapter 2.4.3 --- Temporal Pyramid --- p.23Chapter 2.4.4 --- Spatial Pyramid --- p.23Chapter 2.5 --- Image Pre-processing --- p.24Chapter 2.5.1 --- Feature Extraction --- p.24Chapter 2.5.2 --- Spatial Filtering --- p.27Chapter 2.5.3 --- Local Enhancement --- p.31Chapter 2.5.4 --- Dynamic Range Stretching or Compression --- p.32Chapter 2.5.5 --- YIQ Color Model --- p.33Chapter 3 --- Model-based Pose Estimation --- p.35Chapter 3.1 --- Previous Work --- p.35Chapter 3.1.1 --- Estimation from Established Correspondences --- p.36Chapter 3.1.2 --- Direct Estimation from Image Intensities --- p.49Chapter 3.1.3 --- Perspective-3-Point Problem --- p.51Chapter 3.2 --- Our Iterative P3P Algorithm --- p.58Chapter 3.2.1 --- Gauss-Newton Method --- p.60Chapter 3.2.2 --- Dealing with Ambiguity --- p.61Chapter 3.2.3 --- 3D-to-3D Motion Estimation --- p.66Chapter 3.3 --- Experimental Results --- p.68Chapter 3.3.1 --- Synthetic Data --- p.68Chapter 3.3.2 --- Real Images --- p.72Chapter 3.4 --- Discussions --- p.73Chapter 4 --- Panoramic View Analysis --- p.76Chapter 4.1 --- Advanced Mosaic Representation --- p.76Chapter 4.1.1 --- Frame Alignment Policy --- p.77Chapter 4.1.2 --- Multi-resolution Representation --- p.77Chapter 4.1.3 --- Parallax-based Representation --- p.78Chapter 4.1.4 --- Multiple Moving Objects --- p.79Chapter 4.1.5 --- Layers and Tiles --- p.79Chapter 4.2 --- Panorama Construction --- p.79Chapter 4.2.1 --- Image Acquisition --- p.80Chapter 4.2.2 --- Image Alignment --- p.82Chapter 4.2.3 --- Image Integration --- p.88Chapter 4.2.4 --- Significant Residual Estimation --- p.89Chapter 4.3 --- Advanced Alignment Algorithms --- p.90Chapter 4.3.1 --- Patch-based Alignment --- p.91Chapter 4.3.2 --- Global Alignment (Block Adjustment) --- p.92Chapter 4.3.3 --- Local Alignment (Deghosting) --- p.93Chapter 4.4 --- Mosaic Application --- p.94Chapter 4.4.1 --- Visualization Tool --- p.94Chapter 4.4.2 --- Video Manipulation --- p.95Chapter 4.5 --- Experimental Results --- p.96Chapter 5 --- Panoramic Walkthrough --- p.99Chapter 5.1 --- Problem Statement and Notations --- p.100Chapter 5.2 --- Previous Work --- p.101Chapter 5.2.1 --- 3D Modeling and Rendering --- p.102Chapter 5.2.2 --- Branching Movies --- p.103Chapter 5.2.3 --- Texture Window Scaling --- p.104Chapter 5.2.4 --- Problems with Simple Texture Window Scaling --- p.105Chapter 5.3 --- Our Walkthrough Approach --- p.106Chapter 5.3.1 --- Cylindrical Projection onto Image Plane --- p.106Chapter 5.3.2 --- Generating Intermediate Frames --- p.108Chapter 5.3.3 --- Occlusion Handling --- p.114Chapter 5.4 --- Experimental Results --- p.116Chapter 5.5 --- Discussions --- p.116Chapter 6 --- Conclusion --- p.121Chapter A --- Formulation of Fischler and Bolles' Method for P3P Problems --- p.123Chapter B --- Derivation of z1 and z3 in terms of z2 --- p.127Chapter C --- Derivation of e1 and e2 --- p.129Chapter D --- Derivation of the Update Rule for Gauss-Newton Method --- p.130Chapter E --- Proof of (λ1λ2-λ 4)>〉0 --- p.132Chapter F --- Derivation of φ and hi --- p.133Chapter G --- Derivation of w1j to w4j --- p.134Chapter H --- More Experimental Results on Panoramic Stitching Algorithms --- p.138Bibliography --- p.14

    Radiometric calibration methods from image sequences

    Get PDF
    In many computer vision systems, an image of a scene is assumed to directly reflect the scene radiance. However, this is not the case for most cameras as the radiometric response function which is a mapping from the scene radiance to the image brightness is nonlinear. In addition, the exposure settings of the camera are adjusted (often in the auto-exposure mode) according to the dynamic range of the scene changing the appearance of the scene in the images. Vignetting effect which refers to the gradual fading-out of an image at points near its periphery also contributes in changing the scene appearance in images. In this dissertation, I present several algorithms to compute the radiometric properties of a camera which enable us to find the relationship between the image brightness and the scene radiance. First, I introduce an algorithm to compute the vignetting function, the response function, and the exposure values that fully explain the radiometric image formation process from a set of images of a scene taken with different and unknown exposure values. One of the key features of the proposed method is that the movement of the camera is not limited when taking the pictures whereas most existing methods limit the motion of the camera. Then I present a joint feature tracking and radiometric calibration scheme which performs an integrated radiometric calibration in contrast to previous radiometric calibration techniques which require the correspondences as an input which leads to a chicken-and-egg problem as precise tracking requires accurate radiometric calibration. By combining both into an integrated approach we solve this chicken-and-egg problem. Finally, I propose a radiometric calibration method suited for a set of images of an outdoor scene taken at a regular interval over a period of time. This type of data is a challenging problem because the illumination for each image is changing causing the exposure of the camera to change and the conventional radiometric calibration framework cannot be used for this type of data. The proposed methods are applied to radiometrically align images for seamless mosaics and 3D model textures, to create high dynamic range mosaics, and to build an adaptive stereo system

    Synthesizing and Editing Photo-realistic Visual Objects

    Get PDF
    In this thesis we investigate novel methods of synthesizing new images of a deformable visual object using a collection of images of the object. We investigate both parametric and non-parametric methods as well as a combination of the two methods for the problem of image synthesis. Our main focus are complex visual objects, specifically deformable objects and objects with varying numbers of visible parts. We first introduce sketch-driven image synthesis system, which allows the user to draw ellipses and outlines in order to sketch a rough shape of animals as a constraint to the synthesized image. This system interactively provides feedback in the form of ellipse and contour suggestions to the partial sketch of the user. The user's sketch guides the non-parametric synthesis algorithm that blends patches from two exemplar images in a coarse-to-fine fashion to create a final image. We evaluate the method and synthesized images through two user studies. Instead of non-parametric blending of patches, a parametric model of the appearance is more desirable as its appearance representation is shared between all images of the dataset. Hence, we propose Context-Conditioned Component Analysis, a probabilistic generative parametric model, which described images with a linear combination of basis functions. The basis functions are evaluated for each pixel using a context vector computed from the local shape information. We evaluate C-CCA qualitatively and quantitatively on inpainting, appearance transfer and reconstruction tasks. Drawing samples of C-CCA generates novel, globally-coherent images, which, unfortunately, lack high-frequency details due to dimensionality reduction and misalignment. We develop a non-parametric model that enhances the samples of C-CCA with locally-coherent, high-frequency details. The non-parametric model efficiently finds patches from the dataset that match the C-CCA sample and blends the patches together. We analyze the results of the combined method on the datasets of horse and elephant images

    Automatic video segmentation employing object/camera modeling techniques

    Get PDF
    Practically established video compression and storage techniques still process video sequences as rectangular images without further semantic structure. However, humans watching a video sequence immediately recognize acting objects as semantic units. This semantic object separation is currently not reflected in the technical system, making it difficult to manipulate the video at the object level. The realization of object-based manipulation will introduce many new possibilities for working with videos like composing new scenes from pre-existing video objects or enabling user-interaction with the scene. Moreover, object-based video compression, as defined in the MPEG-4 standard, can provide high compression ratios because the foreground objects can be sent independently from the background. In the case that the scene background is static, the background views can even be combined into a large panoramic sprite image, from which the current camera view is extracted. This results in a higher compression ratio since the sprite image for each scene only has to be sent once. A prerequisite for employing object-based video processing is automatic (or at least user-assisted semi-automatic) segmentation of the input video into semantic units, the video objects. This segmentation is a difficult problem because the computer does not have the vast amount of pre-knowledge that humans subconsciously use for object detection. Thus, even the simple definition of the desired output of a segmentation system is difficult. The subject of this thesis is to provide algorithms for segmentation that are applicable to common video material and that are computationally efficient. The thesis is conceptually separated into three parts. In Part I, an automatic segmentation system for general video content is described in detail. Part II introduces object models as a tool to incorporate userdefined knowledge about the objects to be extracted into the segmentation process. Part III concentrates on the modeling of camera motion in order to relate the observed camera motion to real-world camera parameters. The segmentation system that is described in Part I is based on a background-subtraction technique. The pure background image that is required for this technique is synthesized from the input video itself. Sequences that contain rotational camera motion can also be processed since the camera motion is estimated and the input images are aligned into a panoramic scene-background. This approach is fully compatible to the MPEG-4 video-encoding framework, such that the segmentation system can be easily combined with an object-based MPEG-4 video codec. After an introduction to the theory of projective geometry in Chapter 2, which is required for the derivation of camera-motion models, the estimation of camera motion is discussed in Chapters 3 and 4. It is important that the camera-motion estimation is not influenced by foreground object motion. At the same time, the estimation should provide accurate motion parameters such that all input frames can be combined seamlessly into a background image. The core motion estimation is based on a feature-based approach where the motion parameters are determined with a robust-estimation algorithm (RANSAC) in order to distinguish the camera motion from simultaneously visible object motion. Our experiments showed that the robustness of the original RANSAC algorithm in practice does not reach the theoretically predicted performance. An analysis of the problem has revealed that this is caused by numerical instabilities that can be significantly reduced by a modification that we describe in Chapter 4. The synthetization of static-background images is discussed in Chapter 5. In particular, we present a new algorithm for the removal of the foreground objects from the background image such that a pure scene background remains. The proposed algorithm is optimized to synthesize the background even for difficult scenes in which the background is only visible for short periods of time. The problem is solved by clustering the image content for each region over time, such that each cluster comprises static content. Furthermore, it is exploited that the times, in which foreground objects appear in an image region, are similar to the corresponding times of neighboring image areas. The reconstructed background could be used directly as the sprite image in an MPEG-4 video coder. However, we have discovered that the counterintuitive approach of splitting the background into several independent parts can reduce the overall amount of data. In the case of general camera motion, the construction of a single sprite image is even impossible. In Chapter 6, a multi-sprite partitioning algorithm is presented, which separates the video sequence into a number of segments, for which independent sprites are synthesized. The partitioning is computed in such a way that the total area of the resulting sprites is minimized, while simultaneously satisfying additional constraints. These include a limited sprite-buffer size at the decoder, and the restriction that the image resolution in the sprite should never fall below the input-image resolution. The described multisprite approach is fully compatible to the MPEG-4 standard, but provides three advantages. First, any arbitrary rotational camera motion can be processed. Second, the coding-cost for transmitting the sprite images is lower, and finally, the quality of the decoded sprite images is better than in previously proposed sprite-generation algorithms. Segmentation masks for the foreground objects are computed with a change-detection algorithm that compares the pure background image with the input images. A special effect that occurs in the change detection is the problem of image misregistration. Since the change detection compares co-located image pixels in the camera-motion compensated images, a small error in the motion estimation can introduce segmentation errors because non-corresponding pixels are compared. We approach this problem in Chapter 7 by integrating risk-maps into the segmentation algorithm that identify pixels for which misregistration would probably result in errors. For these image areas, the change-detection algorithm is modified to disregard the difference values for the pixels marked in the risk-map. This modification significantly reduces the number of false object detections in fine-textured image areas. The algorithmic building-blocks described above can be combined into a segmentation system in various ways, depending on whether camera motion has to be considered or whether real-time execution is required. These different systems and example applications are discussed in Chapter 8. Part II of the thesis extends the described segmentation system to consider object models in the analysis. Object models allow the user to specify which objects should be extracted from the video. In Chapters 9 and 10, a graph-based object model is presented in which the features of the main object regions are summarized in the graph nodes, and the spatial relations between these regions are expressed with the graph edges. The segmentation algorithm is extended by an object-detection algorithm that searches the input image for the user-defined object model. We provide two objectdetection algorithms. The first one is specific for cartoon sequences and uses an efficient sub-graph matching algorithm, whereas the second processes natural video sequences. With the object-model extension, the segmentation system can be controlled to extract individual objects, even if the input sequence comprises many objects. Chapter 11 proposes an alternative approach to incorporate object models into a segmentation algorithm. The chapter describes a semi-automatic segmentation algorithm, in which the user coarsely marks the object and the computer refines this to the exact object boundary. Afterwards, the object is tracked automatically through the sequence. In this algorithm, the object model is defined as the texture along the object contour. This texture is extracted in the first frame and then used during the object tracking to localize the original object. The core of the algorithm uses a graph representation of the image and a newly developed algorithm for computing shortest circular-paths in planar graphs. The proposed algorithm is faster than the currently known algorithms for this problem, and it can also be applied to many alternative problems like shape matching. Part III of the thesis elaborates on different techniques to derive information about the physical 3-D world from the camera motion. In the segmentation system, we employ camera-motion estimation, but the obtained parameters have no direct physical meaning. Chapter 12 discusses an extension to the camera-motion estimation to factorize the motion parameters into physically meaningful parameters (rotation angles, focal-length) using camera autocalibration techniques. The speciality of the algorithm is that it can process camera motion that spans several sprites by employing the above multi-sprite technique. Consequently, the algorithm can be applied to arbitrary rotational camera motion. For the analysis of video sequences, it is often required to determine and follow the position of the objects. Clearly, the object position in image coordinates provides little information if the viewing direction of the camera is not known. Chapter 13 provides a new algorithm to deduce the transformation between the image coordinates and the real-world coordinates for the special application of sport-video analysis. In sport videos, the camera view can be derived from markings on the playing field. For this reason, we employ a model of the playing field that describes the arrangement of lines. After detecting significant lines in the input image, a combinatorial search is carried out to establish correspondences between lines in the input image and lines in the model. The algorithm requires no information about the specific color of the playing field and it is very robust to occlusions or poor lighting conditions. Moreover, the algorithm is generic in the sense that it can be applied to any type of sport by simply exchanging the model of the playing field. In Chapter 14, we again consider panoramic background images and particularly focus ib their visualization. Apart from the planar backgroundsprites discussed previously, a frequently-used visualization technique for panoramic images are projections onto a cylinder surface which is unwrapped into a rectangular image. However, the disadvantage of this approach is that the viewer has no good orientation in the panoramic image because he looks into all directions at the same time. In order to provide a more intuitive presentation of wide-angle views, we have developed a visualization technique specialized for the case of indoor environments. We present an algorithm to determine the 3-D shape of the room in which the image was captured, or, more generally, to compute a complete floor plan if several panoramic images captured in each of the rooms are provided. Based on the obtained 3-D geometry, a graphical model of the rooms is constructed, where the walls are displayed with textures that are extracted from the panoramic images. This representation enables to conduct virtual walk-throughs in the reconstructed room and therefore, provides a better orientation for the user. Summarizing, we can conclude that all segmentation techniques employ some definition of foreground objects. These definitions are either explicit, using object models like in Part II of this thesis, or they are implicitly defined like in the background synthetization in Part I. The results of this thesis show that implicit descriptions, which extract their definition from video content, work well when the sequence is long enough to extract this information reliably. However, high-level semantics are difficult to integrate into the segmentation approaches that are based on implicit models. Intead, those semantics should be added as postprocessing steps. On the other hand, explicit object models apply semantic pre-knowledge at early stages of the segmentation. Moreover, they can be applied to short video sequences or even still pictures since no background model has to be extracted from the video. The definition of a general object-modeling technique that is widely applicable and that also enables an accurate segmentation remains an important yet challenging problem for further research

    4th Annual Fall Undergraduate Research Symposium

    Get PDF
    • …
    corecore