4,517 research outputs found

    Neural 3D Mesh Renderer

    Full text link
    For modeling the 3D world behind 2D images, which 3D representation is most appropriate? A polygon mesh is a promising candidate for its compactness and geometric properties. However, it is not straightforward to model a polygon mesh from 2D images using neural networks because the conversion from a mesh to an image, or rendering, involves a discrete operation called rasterization, which prevents back-propagation. Therefore, in this work, we propose an approximate gradient for rasterization that enables the integration of rendering into neural networks. Using this renderer, we perform single-image 3D mesh reconstruction with silhouette image supervision and our system outperforms the existing voxel-based approach. Additionally, we perform gradient-based 3D mesh editing operations, such as 2D-to-3D style transfer and 3D DeepDream, with 2D supervision for the first time. These applications demonstrate the potential of the integration of a mesh renderer into neural networks and the effectiveness of our proposed renderer

    Spherical Regression: Learning Viewpoints, Surface Normals and 3D Rotations on n-Spheres

    Get PDF
    Many computer vision challenges require continuous outputs, but tend to be solved by discrete classification. The reason is classification's natural containment within a probability nn-simplex, as defined by the popular softmax activation function. Regular regression lacks such a closed geometry, leading to unstable training and convergence to suboptimal local minima. Starting from this insight we revisit regression in convolutional neural networks. We observe many continuous output problems in computer vision are naturally contained in closed geometrical manifolds, like the Euler angles in viewpoint estimation or the normals in surface normal estimation. A natural framework for posing such continuous output problems are nn-spheres, which are naturally closed geometric manifolds defined in the R(n+1)\mathbb{R}^{(n+1)} space. By introducing a spherical exponential mapping on nn-spheres at the regression output, we obtain well-behaved gradients, leading to stable training. We show how our spherical regression can be utilized for several computer vision challenges, specifically viewpoint estimation, surface normal estimation and 3D rotation estimation. For all these problems our experiments demonstrate the benefit of spherical regression. All paper resources are available at https://github.com/leoshine/Spherical_Regression.Comment: CVPR 2019 camera read

    Learning Descriptors for Object Recognition and 3D Pose Estimation

    Full text link
    Detecting poorly textured objects and estimating their 3D pose reliably is still a very challenging problem. We introduce a simple but powerful approach to computing descriptors for object views that efficiently capture both the object identity and 3D pose. By contrast with previous manifold-based approaches, we can rely on the Euclidean distance to evaluate the similarity between descriptors, and therefore use scalable Nearest Neighbor search methods to efficiently handle a large number of objects under a large range of poses. To achieve this, we train a Convolutional Neural Network to compute these descriptors by enforcing simple similarity and dissimilarity constraints between the descriptors. We show that our constraints nicely untangle the images from different objects and different views into clusters that are not only well-separated but also structured as the corresponding sets of poses: The Euclidean distance between descriptors is large when the descriptors are from different objects, and directly related to the distance between the poses when the descriptors are from the same object. These important properties allow us to outperform state-of-the-art object views representations on challenging RGB and RGB-D data.Comment: CVPR 201

    Cognitive Robots for Social Interactions

    Get PDF
    One of my goals is to work towards developing Cognitive Robots, especially with regard to improving the functionalities that facilitate the interaction with human beings and their surrounding objects. Any cognitive system designated for serving human beings must be capable of processing the social signals and eventually enable efficient prediction and planning of appropriate responses. My main focus during my PhD study is to bridge the gap between the motoric space and the visual space. The discovery of the mirror neurons ([RC04]) shows that the visual perception of human motion (visual space) is directly associated to the motor control of the human body (motor space). This discovery poses a large number of challenges in different fields such as computer vision, robotics and neuroscience. One of the fundamental challenges is the understanding of the mapping between 2D visual space and 3D motoric control, and further developing building blocks (primitives) of human motion in the visual space as well as in the motor space. First, I present my study on the visual-motoric mapping of human actions. This study aims at mapping human actions in 2D videos to 3D skeletal representation. Second, I present an automatic algorithm to decompose motion capture (MoCap) sequences into synergies along with the times at which they are executed (or "activated") for each joint. Third, I proposed to use the Granger Causality as a tool to study the coordinated actions performed by at least two units. Recent scientific studies suggest that the above "action mirroring circuit" might be tuned to action coordination rather than single action mirroring. Fourth, I present the extraction of key poses in visual space. These key poses facilitate the further study of the "action mirroring circuit". I conclude the dissertation by describing the future of cognitive robotics study

    Computer Aided ECG Analysis - State of the Art and Upcoming Challenges

    Full text link
    In this paper we present current achievements in computer aided ECG analysis and their applicability in real world medical diagnosis process. Most of the current work is covering problems of removing noise, detecting heartbeats and rhythm-based analysis. There are some advancements in particular ECG segments detection and beat classifications but with limited evaluations and without clinical approvals. This paper presents state of the art advancements in those areas till present day. Besides this short computer science and signal processing literature review, paper covers future challenges regarding the ECG signal morphology analysis deriving from the medical literature review. Paper is concluded with identified gaps in current advancements and testing, upcoming challenges for future research and a bullseye test is suggested for morphology analysis evaluation.Comment: 7 pages, 3 figures, IEEE EUROCON 2013 International conference on computer as a tool, 1-4 July 2013, Zagreb, Croati

    Vision-based 3D Pose Retrieval and Reconstruction

    Get PDF
    The people analysis and the understandings of their motions are the key components in many applications like sports sciences, biomechanics, medical rehabilitation, animated movie productions and the game industry. In this context, retrieval and reconstruction of the articulated 3D human poses are considered as the significant sub-elements. In this dissertation, we address the problem of retrieval and reconstruction of the 3D poses from a monocular video or even from a single RGB image. We propose a few data-driven pipelines to retrieve and reconstruct the 3D poses by exploiting the motion capture data as a prior. The main focus of our proposed approaches is to bridge the gap between the separate media of the 3D marker-based recording and the capturing of motions or photographs using a simple RGB camera. In principal, we leverage both media together efficiently for 3D pose estimation. We have shown that our proposed methodologies need not any synchronized 3D-2D pose-image pairs to retrieve and reconstruct the final 3D poses, and are flexible enough to capture motion in any studio-like indoor environment or outdoor natural environment. In first part of the dissertation, we propose model based approaches for full body human motion reconstruction from the video input by employing just 2D joint positions of the four end effectors and the head. We resolve the 3D-2D pose-image cross model correspondence by developing an intermediate container the knowledge base through the motion capture data which contains information about how people move. It includes the 3D normalized pose space and the corresponding synchronized 2D normalized pose space created by utilizing a number of virtual cameras. We first detect and track the features of these five joints from the input motion sequences using SURF, MSER and colorMSER feature detectors, which vote for the possible 2D locations for these joints in the video. The extraction of suitable feature sets from both, the input control signals and the motion capture data, enables us to retrieve the closest instances from the motion capture dataset through employing the fast searching and retrieval techniques. We develop a graphical structure online lazy neighbourhood graph in order to make the similarity search more accurate and robust by deploying the temporal coherence of the input control signals. The retrieved prior poses are exploited further in order to stabilize the feature detection and tracking process. Finally, the 3D motion sequences are reconstructed by a non-linear optimizer that takes into account multiple energy terms. We evaluate our approaches with a series of experiment scenarios designed in terms of performing actors, camera viewpoints and the noisy inputs. Only a little preprocessing is needed by our methods and the reconstruction processes run close to real time. The second part of the dissertation is dedicated to 3D human pose estimation from a monocular single image. First, we propose an efficient 3D pose retrieval strategy which leads towards a novel data driven approach to reconstruct a 3D human pose from a monocular still image. We design and devise multiple feature sets for global similarity search. At runtime, we search for the similar poses from a motion capture dataset in a definite feature space made up of specific joints. We introduce two-fold method for camera estimation, where we exploit the view directions at which we perform sampling of the MoCap dataset as well as the MoCap priors to minimize the projection error. We also benefit from the MoCap priors and the joints' weights in order to learn a low-dimensional local 3D pose model which is constrained further by multiple energies to infer the final 3D human pose. We thoroughly evaluate our approach on synthetically generated examples, the real internet images and the hand-drawn sketches. We achieve state-of-the-arts results when the test and MoCap data are from the same dataset and obtain competitive results when the motion capture data is taken from a different dataset. Second, we propose a dual source approach for 3D pose estimation from a single RGB image. One major challenge for 3D pose estimation from a single RGB image is the acquisition of sufficient training data. In particular, collecting large amounts of training data that contain unconstrained images and are annotated with accurate 3D poses is infeasible. We therefore propose to use two independent training sources. The first source consists of images with annotated 2D poses and the second source consists of accurate 3D motion capture data. To integrate both sources, we propose a dual-source approach that combines 2D pose estimation with efficient and robust 3D pose retrieval. In our experiments, we show that our approach achieves state-of-the-art results and is even competitive when the skeleton structures of the two sources differ substantially. In the last part of the dissertation, we focus on how the different techniques, developed for the human motion capturing, retrieval and reconstruction can be adapted to handle the quadruped motion capture data and which new applications may appear. We discuss some particularities which must be considered during capturing the large animal motions. For retrieval, we derive the suitable feature sets in order to perform fast searches into the MoCap dataset for similar motion segments. At the end, we present a data-driven approach to reconstruct the quadruped motions from the video input data

    An overview of view-based 2D-3D indexing methods

    Get PDF
    International audienceThis paper proposes a comprehensive overview of state of the art 2D/3D, view-based indexing methods. The principle of 2D/3D indexing methods consists of describing 3D models by means of a set of 2D shape descriptors, associated with a set of corresponding 2D views (under the assumption of a given projection model). Notably, such an approach makes it possible to identify 3D objects of interest from 2D images/videos. An experimental evaluation is also proposed, in order to examine the influence of the number of views and of the associated viewing angle selection strategies on the retrieval results. Experiments concern both 3D model retrieval and image recognition from a single view. Results obtained show promising performances, with recognition rates from a single view higher then 66%, which opens interesting perspectives in terms of semantic metadata extraction from still images/videos

    View Selection with Geometric Uncertainty Modeling

    Full text link
    Estimating positions of world points from features observed in images is a key problem in 3D reconstruction, image mosaicking,simultaneous localization and mapping and structure from motion. We consider a special instance in which there is a dominant ground plane G\mathcal{G} viewed from a parallel viewing plane S\mathcal{S} above it. Such instances commonly arise, for example, in aerial photography. Consider a world point g∈Gg \in \mathcal{G} and its worst case reconstruction uncertainty ε(g,S)\varepsilon(g,\mathcal{S}) obtained by merging \emph{all} possible views of gg chosen from S\mathcal{S}. We first show that one can pick two views sps_p and sqs_q such that the uncertainty ε(g,{sp,sq})\varepsilon(g,\{s_p,s_q\}) obtained using only these two views is almost as good as (i.e. within a small constant factor of) ε(g,S)\varepsilon(g,\mathcal{S}). Next, we extend the result to the entire ground plane G\mathcal{G} and show that one can pick a small subset of S′⊆S\mathcal{S'} \subseteq \mathcal{S} (which grows only linearly with the area of G\mathcal{G}) and still obtain a constant factor approximation, for every point g∈Gg \in \mathcal{G}, to the minimum worst case estimate obtained by merging all views in S\mathcal{S}. Finally, we present a multi-resolution view selection method which extends our techniques to non-planar scenes. We show that the method can produce rich and accurate dense reconstructions with a small number of views. Our results provide a view selection mechanism with provable performance guarantees which can drastically increase the speed of scene reconstruction algorithms. In addition to theoretical results, we demonstrate their effectiveness in an application where aerial imagery is used for monitoring farms and orchards
    • …
    corecore