43,596 research outputs found

    Self-Supervised Relative Depth Learning for Urban Scene Understanding

    Full text link
    As an agent moves through the world, the apparent motion of scene elements is (usually) inversely proportional to their depth. It is natural for a learning agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, faraway mountains don't move much; nearby trees move a lot. This natural relationship between the appearance of objects and their motion is a rich source of information about the world. In this work, we start by training a deep network, using fully automatic supervision, to predict relative scene depth from single images. The relative depth training images are automatically derived from simple videos of cars moving through a scene, using recent motion segmentation techniques, and no human-provided labels. This proxy task of predicting relative depth from a single image induces features in the network that result in large improvements in a set of downstream tasks including semantic segmentation, joint road segmentation and car detection, and monocular (absolute) depth estimation, over a network trained from scratch. The improvement on the semantic segmentation task is greater than those produced by any other automatically supervised methods. Moreover, for monocular depth estimation, our unsupervised pre-training method even outperforms supervised pre-training with ImageNet. In addition, we demonstrate benefits from learning to predict (unsupervised) relative depth in the specific videos associated with various downstream tasks. We adapt to the specific scenes in those tasks in an unsupervised manner to improve performance. In summary, for semantic segmentation, we present state-of-the-art results among methods that do not use supervised pre-training, and we even exceed the performance of supervised ImageNet pre-trained models for monocular depth estimation, achieving results that are comparable with state-of-the-art methods

    Real-Time Human Motion Capture with Multiple Depth Cameras

    Full text link
    Commonly used human motion capture systems require intrusive attachment of markers that are visually tracked with multiple cameras. In this work we present an efficient and inexpensive solution to markerless motion capture using only a few Kinect sensors. Unlike the previous work on 3d pose estimation using a single depth camera, we relax constraints on the camera location and do not assume a co-operative user. We apply recent image segmentation techniques to depth images and use curriculum learning to train our system on purely synthetic data. Our method accurately localizes body parts without requiring an explicit shape model. The body joint locations are then recovered by combining evidence from multiple views in real-time. We also introduce a dataset of ~6 million synthetic depth frames for pose estimation from multiple cameras and exceed state-of-the-art results on the Berkeley MHAD dataset.Comment: Accepted to computer robot vision 201

    Geometry meets semantics for semi-supervised monocular depth estimation

    Full text link
    Depth estimation from a single image represents a very exciting challenge in computer vision. While other image-based depth sensing techniques leverage on the geometry between different viewpoints (e.g., stereo or structure from motion), the lack of these cues within a single image renders ill-posed the monocular depth estimation task. For inference, state-of-the-art encoder-decoder architectures for monocular depth estimation rely on effective feature representations learned at training time. For unsupervised training of these models, geometry has been effectively exploited by suitable images warping losses computed from views acquired by a stereo rig or a moving camera. In this paper, we make a further step forward showing that learning semantic information from images enables to improve effectively monocular depth estimation as well. In particular, by leveraging on semantically labeled images together with unsupervised signals gained by geometry through an image warping loss, we propose a deep learning approach aimed at joint semantic segmentation and depth estimation. Our overall learning framework is semi-supervised, as we deploy groundtruth data only in the semantic domain. At training time, our network learns a common feature representation for both tasks and a novel cross-task loss function is proposed. The experimental findings show how, jointly tackling depth prediction and semantic segmentation, allows to improve depth estimation accuracy. In particular, on the KITTI dataset our network outperforms state-of-the-art methods for monocular depth estimation.Comment: 16 pages, Accepted to ACCV 201

    Layered And Feature Based Image Segmentation Using Vector Filtering

    Get PDF
    A Sensor is a device that reads the attribute and changes it into a signal that can be simply examined by an observer or instrument. Sensors are worked in daily objects like touch-sensitive elevator buttons, road traffic monitoring system and so on. Each sensor would carry distinctive capabilities to utilize. The objects obtained in the sensor are tracked by many techniques which have been presented earlier. The techniques which make use of the information from diverse sensors normally termed as data fusion. The previous work defined the object tracking using Multi-Phase Joint Segmentation-Registration (MP JSR) technique for layered images. The downside of the previous work is that the MP JSR technique cannot be applied to the natural objects and the segmentation of the object is also being an inefficient one. To overcome the issues, here we are going to present an efficient joint motion segmentation and registration framework with integrated layer-based and feature-based motion estimation for precise data fusion in real image sequences and tracking of interested objects. Interested points are segmented with vector filtering using random samples of motion frames to derive candidate regions. The experimental evaluation is conducted with real image sequences samples to evaluate the effectiveness of data fusion using integrated layer and feature based image segmentation and registration of motion frames in terms of inter frame prediction, image layers, image clarity

    Automated Motion Analysis of Bony Joint Structures from Dynamic Computer Tomography Images: A Multi-Atlas Approach

    Get PDF
    Dynamic computer tomography (CT) is an emerging modality to analyze in-vivo joint kinematics at the bone level, but it requires manual bone segmentation and, in some instances, landmark identification. The objective of this study is to present an automated workflow for the assessment of three-dimensional in vivo joint kinematics from dynamic musculoskeletal CT images. The proposed method relies on a multi-atlas, multi-label segmentation and landmark propagation framework to extract bony structures and detect anatomical landmarks on the CT dataset. The segmented structures serve as regions of interest for the subsequent motion estimation across the dynamic sequence. The landmarks are propagated across the dynamic sequence for the construction of bone embedded reference frames from which kinematic parameters are estimated. We applied our workflow on dynamic CT images obtained from 15 healthy subjects on two different joints: thumb base (n = 5) and knee (n = 10). The proposed method resulted in segmentation accuracies of 0.90 ± 0.01 for the thumb dataset and 0.94 ± 0.02 for the knee as measured by the Dice score coefficient. In terms of motion estimation, mean differences in cardan angles between the automated algorithm and manual segmentation, and landmark identification performed by an expert were below 1◦. Intraclass correlation (ICC) between cardan angles from the algorithm and results from expert manual landmarks ranged from 0.72 to 0.99 for all joints across all axes. The proposed automated method resulted in reproducible and reliable measurements, enabling the assessment of joint kinematics using 4DCT in clinical routine

    Automated motion analysis of bony joint structures from dynamic computer tomography images: A multi-atlas approach

    Get PDF
    Dynamic computer tomography (CT) is an emerging modality to analyze in-vivo joint kinematics at the bone level, but it requires manual bone segmentation and, in some instances, landmark identification. The objective of this study is to present an automated workflow for the assessment of three-dimensional in vivo joint kinematics from dynamic musculoskeletal CT images. The proposed method relies on a multi-atlas, multi-label segmentation and landmark propagation framework to extract bony structures and detect anatomical landmarks on the CT dataset. The segmented structures serve as regions of interest for the subsequent motion estimation across the dynamic sequence. The landmarks are propagated across the dynamic sequence for the construction of bone embedded reference frames from which kinematic parameters are estimated. We applied our workflow on dynamic CT images obtained from 15 healthy subjects on two different joints: thumb base (n = 5) and knee (n = 10). The proposed method resulted in segmentation accuracies of 0.90 ± 0.01 for the thumb dataset and 0.94 ± 0.02 for the knee as measured by the Dice score coefficient. In terms of motion estimation, mean differences in cardan angles between the automated algorithm and manual segmentation, and landmark identification performed by an expert were below 1◦. Intraclass correlation (ICC) between cardan angles from the algorithm and results from expert manual landmarks ranged from 0.72 to 0.99 for all joints across all axes. The proposed automated method resulted in reproducible and reliable measurements, enabling the assessment of joint kinematics using 4DCT in clinical routine

    U4D: Unsupervised 4D Dynamic Scene Understanding

    Full text link
    We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (approx 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.Comment: To appear in IEEE International Conference in Computer Vision ICCV 201

    From uncertainty to adaptivity : multiscale edge detection and image segmentation

    Get PDF
    This thesis presents the research on two different tasks in computer vision: edge detection and image segmentation (including texture segmentation and motion field segmentation). The central issue of this thesis is the uncertainty of the joint space-frequency image analysis, which motivates the design of the adaptive multiscale/multiresolution schemes for edge detection and image segmentation. Edge detectors capture most of the local features in an image, including the object boundaries and the details of surface textures. Apart from these edge features, the region properties of surface textures and motion fields are also important for segmenting an image into disjoint regions. The major theoretical achievements of this thesis are twofold. First, a scale parameter for the local processing of an image (e.g. edge detection) is proposed. The corresponding edge behaviour in the scale space, referred to as Bounded Diffusion, is the basis of a multiscale edge detector where the scale is adjusted adaptively according to the local noise level. Second, an adaptive multiresolution clustering scheme is proposed for texture segmentation (referred to as Texture Focusing) and motion field segmentation. In this scheme, the central regions of homogeneous textures (motion fields) are analysed using coarse resolutions so as to achieve a better estimation of the textural content (optical flow), and the border region of a texture (motion field) is analysed using fine resolutions so as to achieve a better estimation of the boundary between textures (moving objects). Both of the above two achievements are the logical consequences of the uncertainty principle. Four algorithms, including a roof edge detector, a multiscale step edge detector, a texture segmentation scheme and a motion field segmentation scheme are proposed to address various aspects of edge detection and image segmentation. These algorithms have been implemented and extensively evaluated
    corecore