1,985 research outputs found

    Temporal View Synthesis of Dynamic Scenes through 3D Object Motion Estimation with Multi-Plane Images

    Full text link
    The challenge of graphically rendering high frame-rate videos on low compute devices can be addressed through periodic prediction of future frames to enhance the user experience in virtual reality applications. This is studied through the problem of temporal view synthesis (TVS), where the goal is to predict the next frames of a video given the previous frames and the head poses of the previous and the next frames. In this work, we consider the TVS of dynamic scenes in which both the user and objects are moving. We design a framework that decouples the motion into user and object motion to effectively use the available user motion while predicting the next frames. We predict the motion of objects by isolating and estimating the 3D object motion in the past frames and then extrapolating it. We employ multi-plane images (MPI) as a 3D representation of the scenes and model the object motion as the 3D displacement between the corresponding points in the MPI representation. In order to handle the sparsity in MPIs while estimating the motion, we incorporate partial convolutions and masked correlation layers to estimate corresponding points. The predicted object motion is then integrated with the given user or camera motion to generate the next frame. Using a disocclusion infilling module, we synthesize the regions uncovered due to the camera and object motion. We develop a new synthetic dataset for TVS of dynamic scenes consisting of 800 videos at full HD resolution. We show through experiments on our dataset and the MPI Sintel dataset that our model outperforms all the competing methods in the literature.Comment: To appear in ISMAR 2022; Project website: https://nagabhushansn95.github.io/publications/2022/DeCOMPnet.htm

    Robust density modelling using the student's t-distribution for human action recognition

    Full text link
    The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is highly sensitive to outliers. The Gaussian distribution is also often used as base component of graphical models for recognising human actions in the videos (hidden Markov model and others) and the presence of outliers can significantly affect the recognition accuracy. In contrast, the Student's t-distribution is more robust to outliers and can be exploited to improve the recognition rate in the presence of abnormal data. In this paper, we present an HMM which uses mixtures of t-distributions as observation probabilities and show how experiments over two well-known datasets (Weizmann, MuHAVi) reported a remarkable improvement in classification accuracy. © 2011 IEEE

    Application of Machine Learning within Visual Content Production

    Get PDF
    We are living in an era where digital content is being produced at a dazzling pace. The heterogeneity of contents and contexts is so varied that a numerous amount of applications have been created to respond to people and market demands. The visual content production pipeline is the generalisation of the process that allows a content editor to create and evaluate their product, such as a video, an image, a 3D model, etc. Such data is then displayed on one or more devices such as TVs, PC monitors, virtual reality head-mounted displays, tablets, mobiles, or even smartwatches. Content creation can be simple as clicking a button to film a video and then share it into a social network, or complex as managing a dense user interface full of parameters by using keyboard and mouse to generate a realistic 3D model for a VR game. In this second example, such sophistication results in a steep learning curve for beginner-level users. In contrast, expert users regularly need to refine their skills via expensive lessons, time-consuming tutorials, or experience. Thus, user interaction plays an essential role in the diffusion of content creation software, primarily when it is targeted to untrained people. In particular, with the fast spread of virtual reality devices into the consumer market, new opportunities for designing reliable and intuitive interfaces have been created. Such new interactions need to take a step beyond the point and click interaction typical of the 2D desktop environment. The interactions need to be smart, intuitive and reliable, to interpret 3D gestures and therefore, more accurate algorithms are needed to recognise patterns. In recent years, machine learning and in particular deep learning have achieved outstanding results in many branches of computer science, such as computer graphics and human-computer interface, outperforming algorithms that were considered state of the art, however, there are only fleeting efforts to translate this into virtual reality. In this thesis, we seek to apply and take advantage of deep learning models to two different content production pipeline areas embracing the following subjects of interest: advanced methods for user interaction and visual quality assessment. First, we focus on 3D sketching to retrieve models from an extensive database of complex geometries and textures, while the user is immersed in a virtual environment. We explore both 2D and 3D strokes as tools for model retrieval in VR. Therefore, we implement a novel system for improving accuracy in searching for a 3D model. We contribute an efficient method to describe models through 3D sketch via an iterative descriptor generation, focusing both on accuracy and user experience. To evaluate it, we design a user study to compare different interactions for sketch generation. Second, we explore the combination of sketch input and vocal description to correct and fine-tune the search for 3D models in a database containing fine-grained variation. We analyse sketch and speech queries, identifying a way to incorporate both of them into our system's interaction loop. Third, in the context of the visual content production pipeline, we present a detailed study of visual metrics. We propose a novel method for detecting rendering-based artefacts in images. It exploits analogous deep learning algorithms used when extracting features from sketches

    3D Scene Geometry Estimation from 360^\circ Imagery: A Survey

    Full text link
    This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under the omnidirectional optics. We first revisit the basic concepts of the spherical camera model, and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360^\circ, spherical or panoramic) images and videos. We then survey monocular layout and depth inference approaches, highlighting the recent advances in learning-based solutions suited for spherical data. The classical stereo matching is then revised on the spherical domain, where methodologies for detecting and describing sparse and dense features become crucial. The stereo matching concepts are then extrapolated for multiple view camera setups, categorizing them among light fields, multi-view stereo, and structure from motion (or visual simultaneous localization and mapping). We also compile and discuss commonly adopted datasets and figures of merit indicated for each purpose and list recent results for completeness. We conclude this paper by pointing out current and future trends.Comment: Published in ACM Computing Survey

    Low complexity in-loop perceptual video coding

    Get PDF
    The tradition of broadcast video is today complemented with user generated content, as portable devices support video coding. Similarly, computing is becoming ubiquitous, where Internet of Things (IoT) incorporate heterogeneous networks to communicate with personal and/or infrastructure devices. Irrespective, the emphasises is on bandwidth and processor efficiencies, meaning increasing the signalling options in video encoding. Consequently, assessment for pixel differences applies uniform cost to be processor efficient, in contrast the Human Visual System (HVS) has non-uniform sensitivity based upon lighting, edges and textures. Existing perceptual assessments, are natively incompatible and processor demanding, making perceptual video coding (PVC) unsuitable for these environments. This research allows existing perceptual assessment at the native level using low complexity techniques, before producing new pixel-base image quality assessments (IQAs). To manage these IQAs a framework was developed and implemented in the high efficiency video coding (HEVC) encoder. This resulted in bit-redistribution, where greater bits and smaller partitioning were allocated to perceptually significant regions. Using a HEVC optimised processor the timing increase was < +4% and < +6% for video streaming and recording applications respectively, 1/3 of an existing low complexity PVC solution. Future work should be directed towards perceptual quantisation which offers the potential for perceptual coding gain

    Cortical Dynamics of 3-D Vision and Figure-Ground Pop-Out

    Full text link
    Air Force Office of Scientific Research (90-0175); Defense Advanced Research Projects Agency (90-0083); Office of Naval Research (N00014-91-J-4100
    corecore