214,517 research outputs found

    Learning to Deblur and Rotate Motion-Blurred Faces

    Get PDF
    We propose a solution to the novel task of rendering sharp videos from new viewpoints from a single motion-blurred image of a face. Our method1 handles the complexity of face blur by implicitly learning the geometry and motion of faces through the joint training on three large datasets: FFHQ and 300VW, which are publicly available, and a new Bern Multi-View Face Dataset (BMFD) that we built. The first two datasets provide a large variety of faces and allow our model to generalize better. BMFD instead allows us to introduce multi-view constraints, which are crucial to synthesizing sharp videos from a new camera view. It consists of high frame rate synchronized videos from multiple views of several subjects displaying a wide range of facial expressions. We use the high frame rate videos to simulate realistic motion blur through averaging. Thanks to this dataset, we train a neural network to reconstruct a 3D video representation from a single image and the corresponding face gaze. We then provide a camera viewpoint relative to the estimated gaze and the blurry image as input to an encoder-decoder network to generate a video of sharp frames with a novel camera viewpoint. We demonstrate our approach on test subjects of our multi-view dataset and VIDTIMIT

    Multiview Discriminative Geometry Preserving Projection for Image Classification

    Get PDF
    In many image classification applications, it is common to extract multiple visual features from different views to describe an image. Since different visual features have their own specific statistical properties and discriminative powers for image classification, the conventional solution for multiple view data is to concatenate these feature vectors as a new feature vector. However, this simple concatenation strategy not only ignores the complementary nature of different views, but also ends up with “curse of dimensionality.” To address this problem, we propose a novel multiview subspace learning algorithm in this paper, named multiview discriminative geometry preserving projection (MDGPP) for feature extraction and classification. MDGPP can not only preserve the intraclass geometry and interclass discrimination information under a single view, but also explore the complementary property of different views to obtain a low-dimensional optimal consensus embedding by using an alternating-optimization-based iterative algorithm. Experimental results on face recognition and facial expression recognition demonstrate the effectiveness of the proposed algorithm

    Exploiting Multimodal Information in Deep Learning

    Get PDF
    Humans are good at using multimodal information to perceive and to interact with the world. Such information includes visual, auditory, kinesthetic, etc. Despite the advancement in deep learning using single modality in the past decade, there are relatively fewer works focused on multimodal learning. Even with existing multimodal deep learning works, most of them focus on a small number of modalities. This dissertation will investigate various distinct forms of multi-modal learning: multiple visual modalities as input, audio-visual multimodal input, and visual and proprioceptive (kinesthetic) multimodal input. Specifically, in the first project we investigate synthesizing light fields from a single image and estimated depth. In the second project, we investigate face recognition for unconstrained videos with audio-visual multimodal inputs. Finally, we investigate learning to construct and use tools with visual, proprioceptive and kinesthetic multimodal inputs. In the first task, we investigate synthesizing light fields with a single RGB image and its estimated depth. Synthesizing novel views (light fields) from a single image is very challenging since the depth information is lost, and depth information is crucial for view synthesis. We propose to use a pre-trained model to estimate the depth, and then fuse the depth information together with the RGB image to generate the light fields. Our experiments showed that multimodal input (RGB image and depth) significantly improved the performance over the single image input. In the second task, we focus on learning face recognition for low quality videos. For low quality videos such as low-resolution online videos and surveillance videos, recognizing faces based on video frames alone is very challenging. We propose to use audio information in the video clip to aid in the face recognition task. To achieve this goal, we propose Audio-Visual Aggregation Network (AVAN) to aggregate audio features and visual features using an attention mechanism. Empirical results show that our approach using both visual and audio information significantly improves the face recognition accuracy on unconstrained videos. Finally, in the third task, we propose to use visual, proprioceptive and kinesthetic inputs to learn to construct and use tools. The use of tools in animals indicates high levels of cognitive capability, and, aside from humans, it is observed only in a small number of higher mammals and avian species, and constructing novel tools is an even more challenging task. Learning this task with only visual input is challenging, therefore, we propose to use visual and proprioceptive (kinesthetic) inputs to accelerate the learning. We build a physically simulated environment for tool construction task. We also introduce a hierarchical reinforcement learning approach to learn to construct tools and reach the target, without any prior knowledge. The main contribution of this dissertation is in the investigation of multiple scenarios where multimodal processing leads to enhanced performance. We expect the specific methods developed in this work, such as the extraction of hidden modalities (depth), use of attention, and hierarchical rewards, to help us better understand multimodal processing in deep learning

    Hyperparameter-free losses for model-based monocular reconstruction

    Get PDF
    This work proposes novel hyperparameter-free losses for single view 3D reconstruction with morphable models (3DMM). We dispense with the hyperparameters used in other works by exploiting geometry, so that the shape of the object and the camera pose are jointly optimized in a sole term expression. This simplification reduces the optimization time and its complexity. Moreover, we propose a novel implicit regularization technique based on random virtual projections that does not require additional 2D or 3D annotations. Our experiments suggest that minimizing a shape reprojection error together with the proposed implicit regularization is especially suitable for applications that require precise alignment between geometry and image spaces, such as augmented reality. We evaluate our losses on a large scale dataset with 3D ground truth and publish our implementations to facilitate reproducibility and public benchmarking in this field.Peer ReviewedPostprint (author's final draft
    • …
    corecore