40 research outputs found
Dynamic tree-structured sparse RPCA via column subset selection for background modeling and foreground detection
Video analysis often begins with background subtraction, which consists of creation of a background model that allows distinguishing foreground pixels. Recent evaluation of background subtraction techniques demonstrated that there are still considerable challenges facing these methods. Processing per-pixel basis from the background is not only time-consuming but also can dramatically affect foreground region detection, if region cohesion and contiguity is not considered in the model. We present a new method in which we regard the image sequence to be made up of the sum of a low-rank background matrix and a dynamic tree-structured sparse matrix, and solve the decomposition using our approximated Robust Principal Component Analysis method extended to handle camera motion. Furthermore, to reduce the curse of dimensionality and scale, we introduce a low-rank background modeling via Column Subset Selection that reduces the order of complexity, decreases computation time, and eliminates the huge storage need for large videos
Robust Subspace Estimation via Low-Rank and Sparse Decomposition and Applications in Computer Vision
PhDRecent advances in robust subspace estimation have made dimensionality reduction and
noise and outlier suppression an area of interest for research, along with continuous
improvements in computer vision applications. Due to the nature of image and video
signals that need a high dimensional representation, often storage, processing, transmission,
and analysis of such signals is a difficult task. It is therefore desirable to obtain a
low-dimensional representation for such signals, and at the same time correct for corruptions,
errors, and outliers, so that the signals could be readily used for later processing.
Major recent advances in low-rank modelling in this context were initiated by the work of
Cand`es et al. [17] where the authors provided a solution for the long-standing problem of
decomposing a matrix into low-rank and sparse components in a Robust Principal Component
Analysis (RPCA) framework. However, for computer vision applications RPCA
is often too complex, and/or may not yield desirable results. The low-rank component
obtained by the RPCA has usually an unnecessarily high rank, while in certain tasks
lower dimensional representations are required. The RPCA has the ability to robustly
estimate noise and outliers and separate them from the low-rank component, by a sparse
part. But, it has no mechanism of providing an insight into the structure of the sparse
solution, nor a way to further decompose the sparse part into a random noise and a structured
sparse component that would be advantageous in many computer vision tasks. As
videos signals are usually captured by a camera that is moving, obtaining a low-rank
component by RPCA becomes impossible. In this thesis, novel Approximated RPCA
algorithms are presented, targeting different shortcomings of the RPCA. The Approximated
RPCA was analysed to identify the most time consuming RPCA solutions, and
replace them with simpler yet tractable alternative solutions. The proposed method is
able to obtain the exact desired rank for the low-rank component while estimating a
global transformation to describe camera-induced motion. Furthermore, it is able to
decompose the sparse part into a foreground sparse component, and a random noise
part that contains no useful information for computer vision processing. The foreground
sparse component is obtained by several novel structured sparsity-inducing norms, that
better encapsulate the needed pixel structure in visual signals. Moreover, algorithms for
reducing complexity of low-rank estimation have been proposed that achieve significant
complexity reduction without sacrificing the visual representation of video and image
information. The proposed algorithms are applied to several fundamental computer
vision tasks, namely, high efficiency video coding, batch image alignment, inpainting,
and recovery, video stabilisation, background modelling and foreground segmentation,
robust subspace clustering and motion estimation, face recognition, and ultra high definition
image and video super-resolution. The algorithms proposed in this thesis including
batch image alignment and recovery, background modelling and foreground segmentation,
robust subspace clustering and motion segmentation, and ultra high definition
image and video super-resolution achieve either state-of-the-art or comparable results to
existing methods
Hierarchical improvement of foreground segmentation masks in background subtraction
A plethora of algorithms have been defined for foreground
segmentation, a fundamental stage for many computer
vision applications. In this work, we propose a post-processing
framework to improve foreground segmentation performance of
background subtraction algorithms. We define a hierarchical
framework for extending segmented foreground pixels to undetected
foreground object areas and for removing erroneously
segmented foreground. Firstly, we create a motion-aware hierarchical
image segmentation of each frame that prevents merging
foreground and background image regions. Then, we estimate
the quality of the foreground mask through the fitness of the
binary regions in the mask and the hierarchy of segmented
regions. Finally, the improved foreground mask is obtained as
an optimal labeling by jointly exploiting foreground quality and
spatial color relations in a pixel-wise fully-connected Conditional
Random Field. Experiments are conducted over four large and
heterogeneous datasets with varied challenges (CDNET2014,
LASIESTA, SABS and BMC) demonstrating the capability of the
proposed framework to improve background subtraction resultsThis work was partially supported by the Spanish Government
(HAVideo, TEC2014-53176-R
Robust subspace learning for static and dynamic affect and behaviour modelling
Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as âoutliersâ. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces
MĂ©thodes de vision Ă la motion et leurs applications
La dĂ©tection de mouvement est une opĂ©ration de base souvent utilisĂ©e en vision par ordinateur, que ce soit pour la dĂ©tection de piĂ©tons, la dĂ©tection dâanomalies, lâanalyse de scĂšnes vidĂ©o ou le suivi dâobjets en temps rĂ©el. Bien quâun trĂšs grand nombre dâarticles ait Ă©tĂ© publiĂ©s sur le sujet, plusieurs questions restent en suspens. Par exemple, il nâest toujours pas clair comment dĂ©tecter des objets en mouvement dans des vidĂ©os contenant des situations difficiles Ă gĂ©rer comme d'importants mouvements de fonds et des changements dâillumination. De plus, il nây a pas de consensus sur comment quantifier les performances des mĂ©thodes de dĂ©tection de mouvement. Aussi, il est souvent difficile dâincorporer de lâinformation de mouvement Ă des opĂ©rations de haut niveau comme par exemple la dĂ©tection de piĂ©tons.
Dans cette thĂšse, jâaborde quatre problĂšmes en lien avec la dĂ©tection de mouvement:
1. Comment Ă©valuer efficacement des mĂ©thodes de dĂ©tection de mouvement? Pour rĂ©pondre Ă cette question, nous avons mis sur pied une procĂ©dure dâĂ©valuation de telles mĂ©thodes. Cela a menĂ© Ă la crĂ©ation de la plus grosse base de donnĂ©es 100\% annotĂ©e au monde dĂ©diĂ©e Ă la dĂ©tection de mouvement et organisĂ© une compĂ©tition internationale (CVPR 2014). Jâai Ă©galement explorĂ© diffĂ©rentes mĂ©triques dâĂ©valuation ainsi que des stratĂ©gies de combinaison de mĂ©thodes de dĂ©tection de mouvement.
2. Lâannotation manuelle de chaque objet en mouvement dans un grand nombre de vidĂ©os est un immense dĂ©fi lors de la crĂ©ation dâune base de donnĂ©es dâanalyse vidĂ©o. Bien quâil existe des mĂ©thodes de segmentation automatiques et semi-automatiques, ces derniĂšres ne sont jamais assez prĂ©cises pour produire des rĂ©sultats de type âvĂ©ritĂ© terrainâ. Pour rĂ©soudre ce problĂšme, nous avons proposĂ© une mĂ©thode interactive de segmentation dâobjets en mouvement basĂ©e sur lâapprentissage profond. Les rĂ©sultats obtenus sont aussi prĂ©cis que ceux obtenus par un ĂȘtre humain tout en Ă©tant 40 fois plus rapide.
3. Les mĂ©thodes de dĂ©tection de piĂ©tons sont trĂšs souvent utilisĂ©es en analyse de la vidĂ©o. Malheureusement, elles souffrent parfois dâun grand nombre de faux positifs ou de faux nĂ©gatifs tout dĂ©pendant de lâajustement des paramĂštres de la mĂ©thode. Dans le but dâaugmenter les performances des mĂ©thodes de dĂ©tection de piĂ©tons, nous avons proposĂ© un filtre non linĂ©aire basĂ©e sur la dĂ©tection de mouvement permettant de grandement rĂ©duire le nombre de faux positifs.
4. Lâinitialisation de fond ({\em background initialization}) est le processus par lequel on cherche Ă retrouver lâimage de fond dâune vidĂ©o sans les objets en mouvement. Bien quâun grand nombre de mĂ©thodes ait Ă©tĂ© proposĂ©, tout comme la dĂ©tection de mouvement, il nâexiste aucune base de donnĂ©e ni procĂ©dure dâĂ©valuation pour de telles mĂ©thodes. Nous avons donc mis sur pied la plus grosse base de donnĂ©es au monde pour ce type dâapplications et avons organisĂ© une compĂ©tition internationale (ICPR 2016).Abstract : Motion detection is a basic video analytic operation on which many high-level computer vision tasks are built upon, e.g., pedestrian detection, anomaly detection, scene understanding and object tracking strategies. Even though a large number of motion detection methods have been proposed in the last decades, some important questions are still unanswered, including: (1) how to separate the foreground from the background accurately even under extremely challenging circumstances? (2) how to evaluate different motion detection methods? And (3) how to use motion information extracted by motion detection to help improving high-level computer vision tasks?
In this thesis, we address four problems related to motion detection:
1. How can we benchmark (and on which videos) motion detection method? Current datasets are either too small with a limited number of scenarios, or only provide bounding box ground truth that indicates the rough location of foreground objects. As a solution, we built the largest and most objective motion detection dataset in the world with pixel accurate ground truth to evaluate and compare motion detection methods. We also explore various evaluation metrics as well as different combination strategies.
2. Providing pixel accurate ground truth is a huge challenge when building a motion detection dataset. While automatic labeling methods suffer from a too large false detection rate to be used as ground truth, manual labeling of hundreds of thousands of frames is extremely time consuming. To solve this problem, we proposed an interactive deep learning method for segmenting moving objects from videos. The proposed method can reach human-level accuracies while lowering the labeling time by a factor of 40.
3. Pedestrian detectors always suffer from either false positive detections or false negative detections all depending on the parameter tuning. Unfortunately, manual adjustment of parameters for a large number of videos is not feasible in practice. In order to make pedestrian detectors more robust on a large variety of videos, we combined motion detection with various state-of-the-art pedestrian detectors. This is done by a novel motion-based nonlinear filtering process which improves detectors by a significant margin.
4. Scene background initialization is the process by which a method tries to recover the RGB background image of a video without foreground objects in it. However, one of the reasons that background modeling is challenging is that there is no good dataset and benchmarking framework to estimate the performance of background modeling methods. To fix this problem, we proposed an extensive survey as well as a novel benchmarking framework for scene background initialization