40 research outputs found

    Dynamic tree-structured sparse RPCA via column subset selection for background modeling and foreground detection

    Get PDF
    Video analysis often begins with background subtraction, which consists of creation of a background model that allows distinguishing foreground pixels. Recent evaluation of background subtraction techniques demonstrated that there are still considerable challenges facing these methods. Processing per-pixel basis from the background is not only time-consuming but also can dramatically affect foreground region detection, if region cohesion and contiguity is not considered in the model. We present a new method in which we regard the image sequence to be made up of the sum of a low-rank background matrix and a dynamic tree-structured sparse matrix, and solve the decomposition using our approximated Robust Principal Component Analysis method extended to handle camera motion. Furthermore, to reduce the curse of dimensionality and scale, we introduce a low-rank background modeling via Column Subset Selection that reduces the order of complexity, decreases computation time, and eliminates the huge storage need for large videos

    Robust Subspace Estimation via Low-Rank and Sparse Decomposition and Applications in Computer Vision

    Get PDF
    PhDRecent advances in robust subspace estimation have made dimensionality reduction and noise and outlier suppression an area of interest for research, along with continuous improvements in computer vision applications. Due to the nature of image and video signals that need a high dimensional representation, often storage, processing, transmission, and analysis of such signals is a difficult task. It is therefore desirable to obtain a low-dimensional representation for such signals, and at the same time correct for corruptions, errors, and outliers, so that the signals could be readily used for later processing. Major recent advances in low-rank modelling in this context were initiated by the work of Cand`es et al. [17] where the authors provided a solution for the long-standing problem of decomposing a matrix into low-rank and sparse components in a Robust Principal Component Analysis (RPCA) framework. However, for computer vision applications RPCA is often too complex, and/or may not yield desirable results. The low-rank component obtained by the RPCA has usually an unnecessarily high rank, while in certain tasks lower dimensional representations are required. The RPCA has the ability to robustly estimate noise and outliers and separate them from the low-rank component, by a sparse part. But, it has no mechanism of providing an insight into the structure of the sparse solution, nor a way to further decompose the sparse part into a random noise and a structured sparse component that would be advantageous in many computer vision tasks. As videos signals are usually captured by a camera that is moving, obtaining a low-rank component by RPCA becomes impossible. In this thesis, novel Approximated RPCA algorithms are presented, targeting different shortcomings of the RPCA. The Approximated RPCA was analysed to identify the most time consuming RPCA solutions, and replace them with simpler yet tractable alternative solutions. The proposed method is able to obtain the exact desired rank for the low-rank component while estimating a global transformation to describe camera-induced motion. Furthermore, it is able to decompose the sparse part into a foreground sparse component, and a random noise part that contains no useful information for computer vision processing. The foreground sparse component is obtained by several novel structured sparsity-inducing norms, that better encapsulate the needed pixel structure in visual signals. Moreover, algorithms for reducing complexity of low-rank estimation have been proposed that achieve significant complexity reduction without sacrificing the visual representation of video and image information. The proposed algorithms are applied to several fundamental computer vision tasks, namely, high efficiency video coding, batch image alignment, inpainting, and recovery, video stabilisation, background modelling and foreground segmentation, robust subspace clustering and motion estimation, face recognition, and ultra high definition image and video super-resolution. The algorithms proposed in this thesis including batch image alignment and recovery, background modelling and foreground segmentation, robust subspace clustering and motion segmentation, and ultra high definition image and video super-resolution achieve either state-of-the-art or comparable results to existing methods

    Hierarchical improvement of foreground segmentation masks in background subtraction

    Full text link
    A plethora of algorithms have been defined for foreground segmentation, a fundamental stage for many computer vision applications. In this work, we propose a post-processing framework to improve foreground segmentation performance of background subtraction algorithms. We define a hierarchical framework for extending segmented foreground pixels to undetected foreground object areas and for removing erroneously segmented foreground. Firstly, we create a motion-aware hierarchical image segmentation of each frame that prevents merging foreground and background image regions. Then, we estimate the quality of the foreground mask through the fitness of the binary regions in the mask and the hierarchy of segmented regions. Finally, the improved foreground mask is obtained as an optimal labeling by jointly exploiting foreground quality and spatial color relations in a pixel-wise fully-connected Conditional Random Field. Experiments are conducted over four large and heterogeneous datasets with varied challenges (CDNET2014, LASIESTA, SABS and BMC) demonstrating the capability of the proposed framework to improve background subtraction resultsThis work was partially supported by the Spanish Government (HAVideo, TEC2014-53176-R

    Robust subspace learning for static and dynamic affect and behaviour modelling

    Get PDF
    Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces

    MĂ©thodes de vision Ă  la motion et leurs applications

    Get PDF
    La dĂ©tection de mouvement est une opĂ©ration de base souvent utilisĂ©e en vision par ordinateur, que ce soit pour la dĂ©tection de piĂ©tons, la dĂ©tection d’anomalies, l’analyse de scĂšnes vidĂ©o ou le suivi d’objets en temps rĂ©el. Bien qu’un trĂšs grand nombre d’articles ait Ă©tĂ© publiĂ©s sur le sujet, plusieurs questions restent en suspens. Par exemple, il n’est toujours pas clair comment dĂ©tecter des objets en mouvement dans des vidĂ©os contenant des situations difficiles Ă  gĂ©rer comme d'importants mouvements de fonds et des changements d’illumination. De plus, il n’y a pas de consensus sur comment quantifier les performances des mĂ©thodes de dĂ©tection de mouvement. Aussi, il est souvent difficile d’incorporer de l’information de mouvement Ă  des opĂ©rations de haut niveau comme par exemple la dĂ©tection de piĂ©tons. Dans cette thĂšse, j’aborde quatre problĂšmes en lien avec la dĂ©tection de mouvement: 1. Comment Ă©valuer efficacement des mĂ©thodes de dĂ©tection de mouvement? Pour rĂ©pondre Ă  cette question, nous avons mis sur pied une procĂ©dure d’évaluation de telles mĂ©thodes. Cela a menĂ© Ă  la crĂ©ation de la plus grosse base de donnĂ©es 100\% annotĂ©e au monde dĂ©diĂ©e Ă  la dĂ©tection de mouvement et organisĂ© une compĂ©tition internationale (CVPR 2014). J’ai Ă©galement explorĂ© diffĂ©rentes mĂ©triques d’évaluation ainsi que des stratĂ©gies de combinaison de mĂ©thodes de dĂ©tection de mouvement. 2. L’annotation manuelle de chaque objet en mouvement dans un grand nombre de vidĂ©os est un immense dĂ©fi lors de la crĂ©ation d’une base de donnĂ©es d’analyse vidĂ©o. Bien qu’il existe des mĂ©thodes de segmentation automatiques et semi-automatiques, ces derniĂšres ne sont jamais assez prĂ©cises pour produire des rĂ©sultats de type “vĂ©ritĂ© terrain”. Pour rĂ©soudre ce problĂšme, nous avons proposĂ© une mĂ©thode interactive de segmentation d’objets en mouvement basĂ©e sur l’apprentissage profond. Les rĂ©sultats obtenus sont aussi prĂ©cis que ceux obtenus par un ĂȘtre humain tout en Ă©tant 40 fois plus rapide. 3. Les mĂ©thodes de dĂ©tection de piĂ©tons sont trĂšs souvent utilisĂ©es en analyse de la vidĂ©o. Malheureusement, elles souffrent parfois d’un grand nombre de faux positifs ou de faux nĂ©gatifs tout dĂ©pendant de l’ajustement des paramĂštres de la mĂ©thode. Dans le but d’augmenter les performances des mĂ©thodes de dĂ©tection de piĂ©tons, nous avons proposĂ© un filtre non linĂ©aire basĂ©e sur la dĂ©tection de mouvement permettant de grandement rĂ©duire le nombre de faux positifs. 4. L’initialisation de fond ({\em background initialization}) est le processus par lequel on cherche Ă  retrouver l’image de fond d’une vidĂ©o sans les objets en mouvement. Bien qu’un grand nombre de mĂ©thodes ait Ă©tĂ© proposĂ©, tout comme la dĂ©tection de mouvement, il n’existe aucune base de donnĂ©e ni procĂ©dure d’évaluation pour de telles mĂ©thodes. Nous avons donc mis sur pied la plus grosse base de donnĂ©es au monde pour ce type d’applications et avons organisĂ© une compĂ©tition internationale (ICPR 2016).Abstract : Motion detection is a basic video analytic operation on which many high-level computer vision tasks are built upon, e.g., pedestrian detection, anomaly detection, scene understanding and object tracking strategies. Even though a large number of motion detection methods have been proposed in the last decades, some important questions are still unanswered, including: (1) how to separate the foreground from the background accurately even under extremely challenging circumstances? (2) how to evaluate different motion detection methods? And (3) how to use motion information extracted by motion detection to help improving high-level computer vision tasks? In this thesis, we address four problems related to motion detection: 1. How can we benchmark (and on which videos) motion detection method? Current datasets are either too small with a limited number of scenarios, or only provide bounding box ground truth that indicates the rough location of foreground objects. As a solution, we built the largest and most objective motion detection dataset in the world with pixel accurate ground truth to evaluate and compare motion detection methods. We also explore various evaluation metrics as well as different combination strategies. 2. Providing pixel accurate ground truth is a huge challenge when building a motion detection dataset. While automatic labeling methods suffer from a too large false detection rate to be used as ground truth, manual labeling of hundreds of thousands of frames is extremely time consuming. To solve this problem, we proposed an interactive deep learning method for segmenting moving objects from videos. The proposed method can reach human-level accuracies while lowering the labeling time by a factor of 40. 3. Pedestrian detectors always suffer from either false positive detections or false negative detections all depending on the parameter tuning. Unfortunately, manual adjustment of parameters for a large number of videos is not feasible in practice. In order to make pedestrian detectors more robust on a large variety of videos, we combined motion detection with various state-of-the-art pedestrian detectors. This is done by a novel motion-based nonlinear filtering process which improves detectors by a significant margin. 4. Scene background initialization is the process by which a method tries to recover the RGB background image of a video without foreground objects in it. However, one of the reasons that background modeling is challenging is that there is no good dataset and benchmarking framework to estimate the performance of background modeling methods. To fix this problem, we proposed an extensive survey as well as a novel benchmarking framework for scene background initialization
    corecore