6,648 research outputs found

    Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

    Full text link
    This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques

    Learning Human Pose Estimation Features with Convolutional Networks

    Full text link
    This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the current state-of-the-art results. The main contribution of this paper is showing, for the first time, that a specific variation of deep learning is able to outperform all existing traditional architectures on this task. The paper also discusses several lessons learned while researching alternatives, most notably, that it is possible to learn strong low-level feature detectors on features that might even just cover a few pixels in the image. Higher-level spatial models improve somewhat the overall result, but to a much lesser extent then expected. Many researchers previously argued that the kinematic structure and top-down information is crucial for this domain, but with our purely bottom up, and weak spatial model, we could improve other more complicated architectures that currently produce the best results. This mirrors what many other researchers, like those in the speech recognition, object recognition, and other domains have experienced

    Multi-view Convolutional Neural Networks for 3D Shape Recognition

    Full text link
    A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.Comment: v1: Initial version. v2: An updated ModelNet40 training/test split is used; results with low-rank Mahalanobis metric learning are added. v3 (ICCV 2015): A second camera setup without the upright orientation assumption is added; some accuracy and mAP numbers are changed slightly because a small issue in mesh rendering related to specularities is fixe

    Gaussian belief propagation for real-time decentralised inference

    Get PDF
    For embodied agents to interact intelligently with their surroundings, they require perception systems that construct persistent 3D representations of their environments. These representations must be rich; capturing 3D geometry, semantics, physical properties, affordances and much more. Constructing the environment representation from sensory observations is done via Bayesian probabilistic inference and in practical systems, inference must take place within the power, compactness and simplicity constraints of real products. Efficient inference within these constraints however remains computationally challenging and current systems often require heavy computational resources while delivering a fraction of the desired capabilities. Decentralised algorithms based on local message passing with in-place processing and storage offer a promising solution to current inference bottlenecks. They are well suited to take advantage of recent rapid developments in distributed asynchronous processing hardware to achieve efficient, scalable and low-power performance. In this thesis, we argue for Gaussian belief propagation (GBP) as a strong algorithmic framework for distributed, generic and incremental probabilistic estimation. GBP operates by passing messages between the nodes on a factor graph and can converge with arbitrary asynchronous message schedules. We envisage the factor graph being the fundamental master environment representation, and GBP the flexible inference tool to compute local in-place probabilistic estimates. In large real-time systems, GBP will act as the `glue' between specialised modules, with attention based processing bringing about local convergence in the graph in a just-in-time manner. This thesis contains several technical and theoretical contributions in the application of GBP to practical real-time inference problems in vision and robotics. Additionally, we implement GBP on novel graph processor hardware and demonstrate breakthrough speeds for bundle adjustment problems. Lastly, we present a prototype system for incrementally creating hierarchical abstract scene graphs by combining neural networks and probabilistic inference via GBP.Open Acces

    Distributed Robotic Vision for Calibration, Localisation, and Mapping

    Get PDF
    This dissertation explores distributed algorithms for calibration, localisation, and mapping in the context of a multi-robot network equipped with cameras and onboard processing, comparing against centralised alternatives where all data is transmitted to a singular external node on which processing occurs. With the rise of large-scale camera networks, and as low-cost on-board processing becomes increasingly feasible in robotics networks, distributed algorithms are becoming important for robustness and scalability. Standard solutions to multi-camera computer vision require the data from all nodes to be processed at a central node which represents a significant single point of failure and incurs infeasible communication costs. Distributed solutions solve these issues by spreading the work over the entire network, operating only on local calculations and direct communication with nearby neighbours. This research considers a framework for a distributed robotic vision platform for calibration, localisation, mapping tasks where three main stages are identified: an initialisation stage where calibration and localisation are performed in a distributed manner, a local tracking stage where visual odometry is performed without inter-robot communication, and a global mapping stage where global alignment and optimisation strategies are applied. In consideration of this framework, this research investigates how algorithms can be developed to produce fundamentally distributed solutions, designed to minimise computational complexity whilst maintaining excellent performance, and designed to operate effectively in the long term. Therefore, three primary objectives are sought aligning with these three stages

    Integrated Inference and Learning of Neural Factors in Structural Support Vector Machines

    Get PDF
    Tackling pattern recognition problems in areas such as computer vision, bioinformatics, speech or text recognition is often done best by taking into account task-specific statistical relations between output variables. In structured prediction, this internal structure is used to predict multiple outputs simultaneously, leading to more accurate and coherent predictions. Structural support vector machines (SSVMs) are nonprobabilistic models that optimize a joint input-output function through margin-based learning. Because SSVMs generally disregard the interplay between unary and interaction factors during the training phase, final parameters are suboptimal. Moreover, its factors are often restricted to linear combinations of input features, limiting its generalization power. To improve prediction accuracy, this paper proposes: (i) Joint inference and learning by integration of back-propagation and loss-augmented inference in SSVM subgradient descent; (ii) Extending SSVM factors to neural networks that form highly nonlinear functions of input features. Image segmentation benchmark results demonstrate improvements over conventional SSVM training methods in terms of accuracy, highlighting the feasibility of end-to-end SSVM training with neural factors

    Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

    Full text link
    In this paper we address the problem of human action recognition from video sequences. Inspired by the exemplary results obtained via automatic feature learning and deep learning approaches in computer vision, we focus our attention towards learning salient spatial features via a convolutional neural network (CNN) and then map their temporal relationship with the aid of Long-Short-Term-Memory (LSTM) networks. Our contribution in this paper is a deep fusion framework that more effectively exploits spatial features from CNNs with temporal features from LSTM models. We also extensively evaluate their strengths and weaknesses. We find that by combining both the sets of features, the fully connected features effectively act as an attention mechanism to direct the LSTM to interesting parts of the convolutional feature sequence. The significance of our fusion method is its simplicity and effectiveness compared to other state-of-the-art methods. The evaluation results demonstrate that this hierarchical multi stream fusion method has higher performance compared to single stream mapping methods allowing it to achieve high accuracy outperforming current state-of-the-art methods in three widely used databases: UCF11, UCFSports, jHMDB.Comment: Published as a conference paper at WACV 201

    FutureMapping 2: Gaussian Belief Propagation for Spatial AI

    Full text link
    We argue the case for Gaussian Belief Propagation (GBP) as a strong algorithmic framework for the distributed, generic and incremental probabilistic estimation we need in Spatial AI as we aim at high performance smart robots and devices which operate within the constraints of real products. Processor hardware is changing rapidly, and GBP has the right character to take advantage of highly distributed processing and storage while estimating global quantities, as well as great flexibility. We present a detailed tutorial on GBP, relating to the standard factor graph formulation used in robotics and computer vision, and give several simulation examples with code which demonstrate its properties

    Temporally coherent 4D reconstruction of complex dynamic scenes

    Get PDF
    This paper presents an approach for reconstruction of 4D temporally coherent models of complex dynamic scenes. No prior knowledge is required of scene structure or camera calibration allowing reconstruction from multiple moving cameras. Sparse-to-dense temporal correspondence is integrated with joint multi-view segmentation and reconstruction to obtain a complete 4D representation of static and dynamic objects. Temporal coherence is exploited to overcome visual ambiguities resulting in improved reconstruction of complex scenes. Robust joint segmentation and reconstruction of dynamic objects is achieved by introducing a geodesic star convexity constraint. Comparative evaluation is performed on a variety of unstructured indoor and outdoor dynamic scenes with hand-held cameras and multiple people. This demonstrates reconstruction of complete temporally coherent 4D scene models with improved nonrigid object segmentation and shape reconstruction.Comment: To appear in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016 . Video available at: https://www.youtube.com/watch?v=bm_P13_-Ds
    corecore