453 research outputs found

    Exploiting Long-Term Connectivity and Visual Motion in CRF-based Multi-Person Tracking

    Get PDF
    We present a Conditional Random Field (CRF) approach to tracking-by-detection in which we model pairwise factors linking pairs of detections and their hidden labels, as well as higher order potentials defined in terms of label costs. To the contrary of previous works, our method considers long-term connectivity between pairs of detections and models similarities as well as dissimilarities between them, based on position, color and as novelty, visual motion cues. We introduce a set of feature-specific confidence scores which aim at weighting feature contributions according to their reliability. Pairwise potential parameters are then learned in an unsupervised way from detections or from tracklets. Label costs are defined so as to penalize the complexity of the labeling, based on prior knowledge about the scene, e.g. about the location of entry/exit zones. Experiments on PETS 2009, TUD and CAVIAR datasets show the validity of our approach, and similar or better performance than recent state-of-the-art algorithms

    Long-Term Time-Sensitive Costs for CRF-Based Tracking by Detection

    Get PDF
    We present a Conditional Random Field (CRF) approach to tracking-by-detection in which we model pairwise factors linking pairs of detections and their hidden labels, as well as higher order potentials defined in terms of label costs. Our method considers long-term connectivity between pairs of detections and models cue similarities as well as dissimilarities between them using time-interval sensitive models. In addition to position, color, and visual motion cues, we investigate in this paper the use of SURF cue as structure representations. We take advantage of the MOTChallenge 2016 to refine our tracking models, evaluate our system, and study the impact of different parameters of our tracking system on performance

    Efficient and Accurate Tracking for Face Diarization via Periodical Detection

    Get PDF
    Face diarization, i.e. face tracking and clustering within video documents, is useful and important for video indexing and fast browsing but it is also a difficult and time consuming task. In this paper, we address the tracking aspect and propose a novel algorithm with two main contributions. First, we propose an approach that leverages state-of-the-art deformable part-based model (DPM) face detector with a multi-cue discriminant tracking-by-detection framework that relies on automatically learned long-term time-interval sensitive association costs specific to each document type. Secondly to improve performance, we propose an explicit false alarm removal step at the track level to efficiently filter out wrong detections (and resulting tracks). Altogether, the method is able to skip frames, i.e. process only 3 to 4 frames per second - thus cutting down computational cost - while performing better than state-of-the-art methods as evaluated on three public benchmarks from different context including a movie and broadcast data

    Semantic Mapping of Road Scenes

    Get PDF
    The problem of understanding road scenes has been on the fore-front in the computer vision community for the last couple of years. This enables autonomous systems to navigate and understand the surroundings in which it operates. It involves reconstructing the scene and estimating the objects present in it, such as ‘vehicles’, ‘road’, ‘pavements’ and ‘buildings’. This thesis focusses on these aspects and proposes solutions to address them. First, we propose a solution to generate a dense semantic map from multiple street-level images. This map can be imagined as the bird’s eye view of the region with associated semantic labels for ten’s of kilometres of street level data. We generate the overhead semantic view from street level images. This is in contrast to existing approaches using satellite/overhead imagery for classification of urban region, allowing us to produce a detailed semantic map for a large scale urban area. Then we describe a method to perform large scale dense 3D reconstruction of road scenes with associated semantic labels. Our method fuses the depth-maps in an online fashion, generated from the stereo pairs across time into a global 3D volume, in order to accommodate arbitrarily long image sequences. The object class labels estimated from the street level stereo image sequence are used to annotate the reconstructed volume. Then we exploit the scene structure in object class labelling by performing inference over the meshed representation of the scene. By performing labelling over the mesh we solve two issues: Firstly, images often have redundant information with multiple images describing the same scene. Solving these images separately is slow, where our method is approximately a magnitude faster in the inference stage compared to normal inference in the image domain. Secondly, often multiple images, even though they describe the same scene result in inconsistent labelling. By solving a single mesh, we remove the inconsistency of labelling across the images. Also our mesh based labelling takes into account of the object layout in the scene, which is often ambiguous in the image domain, thereby increasing the accuracy of object labelling. Finally, we perform labelling and structure computation through a hierarchical robust PN Markov Random Field defined on voxels and super-voxels given by an octree. This allows us to infer the 3D structure and the object-class labels in a principled manner, through bounded approximate minimisation of a well defined and studied energy functional. In this thesis, we also introduce two object labelled datasets created from real world data. The 15 kilometre Yotta Labelled dataset consists of 8,000 images per camera view of the roadways of the United Kingdom with a subset of them annotated with object class labels and the second dataset is comprised of ground truth object labels for the publicly available KITTI dataset. Both the datasets are available publicly and we hope will be helpful to the vision research community

    VIDEO FOREGROUND LOCALIZATION FROM TRADITIONAL METHODS TO DEEP LEARNING

    Get PDF
    These days, detection of Visual Attention Regions (VAR), such as moving objects has become an integral part of many Computer Vision applications, viz. pattern recognition, object detection and classification, video surveillance, autonomous driving, human-machine interaction (HMI), and so forth. The moving object identification using bounding boxes has matured to the level of localizing the objects along their rigid borders and the process is called foreground localization (FGL). Over the decades, many image segmentation methodologies have been well studied, devised, and extended to suit the video FGL. Despite that, still, the problem of video foreground (FG) segmentation remains an intriguing task yet appealing due to its ill-posed nature and myriad of applications. Maintaining spatial and temporal coherence, particularly at object boundaries, persists challenging, and computationally burdensome. It even gets harder when the background possesses dynamic nature, like swaying tree branches or shimmering water body, and illumination variations, shadows cast by the moving objects, or when the video sequences have jittery frames caused by vibrating or unstable camera mounts on a surveillance post or moving robot. At the same time, in the analysis of traffic flow or human activity, the performance of an intelligent system substantially depends on its robustness of localizing the VAR, i.e., the FG. To this end, the natural question arises as what is the best way to deal with these challenges? Thus, the goal of this thesis is to investigate plausible real-time performant implementations from traditional approaches to modern-day deep learning (DL) models for FGL that can be applicable to many video content-aware applications (VCAA). It focuses mainly on improving existing methodologies through harnessing multimodal spatial and temporal cues for a delineated FGL. The first part of the dissertation is dedicated for enhancing conventional sample-based and Gaussian mixture model (GMM)-based video FGL using probability mass function (PMF), temporal median filtering, and fusing CIEDE2000 color similarity, color distortion, and illumination measures, and picking an appropriate adaptive threshold to extract the FG pixels. The subjective and objective evaluations are done to show the improvements over a number of similar conventional methods. The second part of the thesis focuses on exploiting and improving deep convolutional neural networks (DCNN) for the problem as mentioned earlier. Consequently, three models akin to encoder-decoder (EnDec) network are implemented with various innovative strategies to improve the quality of the FG segmentation. The strategies are not limited to double encoding - slow decoding feature learning, multi-view receptive field feature fusion, and incorporating spatiotemporal cues through long-shortterm memory (LSTM) units both in the subsampling and upsampling subnetworks. Experimental studies are carried out thoroughly on all conditions from baselines to challenging video sequences to prove the effectiveness of the proposed DCNNs. The analysis demonstrates that the architectural efficiency over other methods while quantitative and qualitative experiments show the competitive performance of the proposed models compared to the state-of-the-art

    Deep Spatio-Temporal Random Fields for Efficient Video Segmentation.

    Get PDF
    In this work we introduce a time- and memory-efficient method for structured prediction that couples neuron decisions across both space at time. We show that we are able to perform exact and efficient inference on a densely-connected spatio-temporal graph by capitalizing on recent advances on deep Gaussian Conditional Random Fields (GCRFs). Our method, called VideoGCRF is (a) efficient, (b) has a unique global minimum, and (c) can be trained end-to-end alongside contemporary deep networks for video understanding. We experiment with multiple connectivity patterns in the temporal domain, and present empirical improvements over strong baselines on the tasks of both semantic and instance segmentation of videos. Our implementation is based on the Caffe2 framework and will be available at https://github.com/siddharthachandra/gcrf-v3.0
    • …
    corecore