19 research outputs found

    Improving Video Segmentation by Fusing Depth Cues and the Visual Background Extractor (ViBe) Algorithm

    Get PDF
    Depth-sensing technology has led to broad applications of inexpensive depth cameras that can capture human motion and scenes in three-dimensional space. Background subtraction algorithms can be improved by fusing color and depth cues, thereby allowing many issues encountered in classical color segmentation to be solved. In this paper, we propose a new fusion method that combines depth and color information for foreground segmentation based on an advanced color-based algorithm. First, a background model and a depth model are developed. Then, based on these models, we propose a new updating strategy that can eliminate ghosting and black shadows almost completely. Extensive experiments have been performed to compare the proposed algorithm with other, conventional RGB-D (Red-Green-Blue and Depth) algorithms. The experimental results suggest that our method extracts foregrounds with higher effectiveness and efficiency

    Foreground object segmentation in RGB-D data implemented on GPU

    Full text link
    This paper presents a GPU implementation of two foreground object segmentation algorithms: Gaussian Mixture Model (GMM) and Pixel Based Adaptive Segmenter (PBAS) modified for RGB-D data support. The simultaneous use of colour (RGB) and depth (D) data allows to improve segmentation accuracy, especially in case of colour camouflage, illumination changes and occurrence of shadows. Three GPUs were used to accelerate calculations: embedded NVIDIA Jetson TX2 (Maxwell architecture), mobile NVIDIA GeForce GTX 1050m (Pascal architecture) and efficient NVIDIA RTX 2070 (Turing architecture). Segmentation accuracy comparable to previously published works was obtained. Moreover, the use of a GPU platform allowed to get real-time image processing. In addition, the system has been adapted to work with two RGB-D sensors: RealSense D415 and D435 from Intel.Comment: 12 pages, 4 figures, submitted to KKA 2020 conferenc

    Self-supervised Multi-Modal Video Forgery Attack Detection

    Full text link
    Video forgery attack threatens the surveillance system by replacing the video captures with unrealistic synthesis, which can be powered by the latest augment reality and virtual reality technologies. From the machine perception aspect, visual objects often have RF signatures that are naturally synchronized with them during recording. In contrast to video captures, the RF signatures are more difficult to attack given their concealed and ubiquitous nature. In this work, we investigate multimodal video forgery attack detection methods using both vision and wireless modalities. Since wireless signal-based human perception is environmentally sensitive, we propose a self-supervised training strategy to enable the system to work without external annotation and thus can adapt to different environments. Our method achieves a perfect human detection accuracy and a high forgery attack detection accuracy of 94.38% which is comparable with supervised methods


    Get PDF
    These days, detection of Visual Attention Regions (VAR), such as moving objects has become an integral part of many Computer Vision applications, viz. pattern recognition, object detection and classification, video surveillance, autonomous driving, human-machine interaction (HMI), and so forth. The moving object identification using bounding boxes has matured to the level of localizing the objects along their rigid borders and the process is called foreground localization (FGL). Over the decades, many image segmentation methodologies have been well studied, devised, and extended to suit the video FGL. Despite that, still, the problem of video foreground (FG) segmentation remains an intriguing task yet appealing due to its ill-posed nature and myriad of applications. Maintaining spatial and temporal coherence, particularly at object boundaries, persists challenging, and computationally burdensome. It even gets harder when the background possesses dynamic nature, like swaying tree branches or shimmering water body, and illumination variations, shadows cast by the moving objects, or when the video sequences have jittery frames caused by vibrating or unstable camera mounts on a surveillance post or moving robot. At the same time, in the analysis of traffic flow or human activity, the performance of an intelligent system substantially depends on its robustness of localizing the VAR, i.e., the FG. To this end, the natural question arises as what is the best way to deal with these challenges? Thus, the goal of this thesis is to investigate plausible real-time performant implementations from traditional approaches to modern-day deep learning (DL) models for FGL that can be applicable to many video content-aware applications (VCAA). It focuses mainly on improving existing methodologies through harnessing multimodal spatial and temporal cues for a delineated FGL. The first part of the dissertation is dedicated for enhancing conventional sample-based and Gaussian mixture model (GMM)-based video FGL using probability mass function (PMF), temporal median filtering, and fusing CIEDE2000 color similarity, color distortion, and illumination measures, and picking an appropriate adaptive threshold to extract the FG pixels. The subjective and objective evaluations are done to show the improvements over a number of similar conventional methods. The second part of the thesis focuses on exploiting and improving deep convolutional neural networks (DCNN) for the problem as mentioned earlier. Consequently, three models akin to encoder-decoder (EnDec) network are implemented with various innovative strategies to improve the quality of the FG segmentation. The strategies are not limited to double encoding - slow decoding feature learning, multi-view receptive field feature fusion, and incorporating spatiotemporal cues through long-shortterm memory (LSTM) units both in the subsampling and upsampling subnetworks. Experimental studies are carried out thoroughly on all conditions from baselines to challenging video sequences to prove the effectiveness of the proposed DCNNs. The analysis demonstrates that the architectural efficiency over other methods while quantitative and qualitative experiments show the competitive performance of the proposed models compared to the state-of-the-art

    Moving Object Detection based on RGBD Information

    Get PDF
    This thesis is targeting the Moving Object Detection topic, more specifically, the Background Subtraction. In this study, we proposed two approaches using color and depth information to solve the background subtraction. The following two paragraphs will give a brief abstract for each approach. In this research study, we propose a framework for improving traditional Background Subtraction techniques. This framework is based on two data types: color and depth; it stands for obtaining preliminary results of the background segmentation using Depth and RGB channels independently, then using an algorithm to fuse them to create the final results. The experiments on the SBM-RGBD dataset using four methods: ViBe, LOBSTER, SuBSENSE, and PAWCS, proved that the proposed framework achieves an impressive performance compared to the original RGB-based techniques from the state-of-the-art. This dissertation also proposes a novel deep learning model called Deep Multi-Scale Network (DMSN) for Background Subtraction. This convolutional neural network is built to use RGB color channels and Depth maps as inputs with which it can fuse semantic and spatial information. Compared with previous Deep Learning Background Subtraction techniques that lack information due to their use of only RGB channels, our RGBD version can overcome most of the drawbacks, especially in some particular challenges. Further, this study introduces a new protocol for the SBM-RGBD dataset regarding scene-independent evaluation, dedicated to Deep Learning methods to set up a competitive platform that includes more challenging situations. The proposed method proved its efficiency in solving the background subtraction in complex problems at different levels. The experimental results verify that the proposed work outperforms the state-of-the-art on SBM-RGBD and GSM datasets

    People detection and tracking using a network of low-cost depth cameras

    Get PDF
    Automaattinen ihmisten havainnointi on jo laajalti käytetty teknologia, jolla on sovelluksia esimerkiksi kaupan ja turvallisuuden aloilla. Tämän diplomityön tarkoituksena on suunnitella yleiskäyttöinen järjestelmä ihmisten havainnointiin sisätiloissa. Tässä työssä ensin esitetään kirjallisuudesta löytyvät ratkaisut ihmisten havainnointiin, seurantaan ja tunnistamiseen. Painopiste on syvyyskuvaa hyödyntävissä havaitsemismenetelmissä. Lisäksi esittellään kehitetty älykkäiden syvyyskameroiden verkko. Havainnointitarkkuutta kokeillaan neljällä kuvasarjalla, jotka sisältävät yli 20 000 syvyyskuvaa. Tulokset ovat lupaavia ja näyttävät, että yksinkertaiset ja laskennallisesti kevyet ratkaisut sopivat hyvin käytännön sovelluksiin.Automatic people detection is a widely adopted technology that has applications in retail stores, crowd management and surveillance. The goal of this work is to create a general purpose people detection framework. First, studies on people detection, tracking and re-identification are reviewed. The emphasis is on people detection from depth images. Furthermore, an approach based on a network of smart depth cameras is presented. The performance is evaluated with four image sequences, totalling over 20 000 depth images. Experimental results show that simple and lightweight algorithms are very useful in practical applications

    Active and Physics-Based Human Pose Reconstruction

    Get PDF
    Perceiving humans is an important and complex problem within computervision. Its significance is derived from its numerous applications, suchas human-robot interaction, virtual reality, markerless motion capture,and human tracking for autonomous driving. The difficulty lies in thevariability in human appearance, physique, and plausible body poses. Inreal-world scenes, this is further exacerbated by difficult lightingconditions, partial occlusions, and the depth ambiguity stemming fromthe loss of information during the 3d to 2d projection. Despite thesechallenges, significant progress has been made in recent years,primarily due to the expressive power of deep neural networks trained onlarge datasets. However, creating large-scale datasets with 3dannotations is expensive, and capturing the vast diversity of the realworld is demanding. Traditionally, 3d ground truth is captured usingmotion capture laboratories that require large investments. Furthermore,many laboratories cannot easily accommodate athletic and dynamicmotions. This thesis studies three approaches to improving visualperception, with emphasis on human pose estimation, that can complementimprovements to the underlying predictor or training data.The first two papers present active human pose estimation, where areinforcement learning agent is tasked with selecting informativeviewpoints to reconstruct subjects efficiently. The papers discard thecommon assumption that the input is given and instead allow the agent tomove to observe subjects from desirable viewpoints, e.g., those whichavoid occlusions and for which the underlying pose estimator has a lowprediction error.The third paper introduces the task of embodied visual active learning,which goes further and assumes that the perceptual model is notpre-trained. Instead, the agent is tasked with exploring its environmentand requesting annotations to refine its visual model. Learning toexplore novel scenarios and efficiently request annotation for new datais a step towards life-long learning, where models can evolve beyondwhat they learned during the initial training phase. We study theproblem for segmentation, though the idea is applicable to otherperception tasks.Lastly, the final two papers propose improving human pose estimation byintegrating physical constraints. These regularize the reconstructedmotions to be physically plausible and serve as a complement to currentkinematic approaches. Whether a motion has been observed in the trainingdata or not, the predictions should obey the laws of physics. Throughintegration with a physical simulator, we demonstrate that we can reducereconstruction artifacts and enforce, e.g., contact constraints

    Multiple Object Tracking in Urban Traffic Scenes

    Get PDF
    RÉSUMÉ:Le suivi multiobjets (MOT) est un domaine très étudié qui a évolué et changé beaucoup durant les années grâce à ses plusieurs applications potentielles pour améliorer notre qualité de vie. Dans notre projet de recherche, spécifiquement, nous sommes intéressés par le MOT dans les scènes de trafic urbain pour extraire précisément les trajectoires des usagers de la route, afin d’améliorer les systèmes de circulation routière desquels nous bénéficions tous.Notre première contribution est l’introduction d’informations sur les étiquettes de classe dans l’ensemble des caractéristiques qui décrivent les objets pour les associer sur différents trames, afin de bien capturer leur mouvement sous forme de trajectoires dans un environnement réel.Nous capitalisons sur les informations provenant d’un détecteur basé sur l’apprentissage profond qui est utilisé pour l’extraction des objets d’intérêt avant la procédure de suivi, carnous avons été intrigués par leurs popularités croissantes et les bonnes performances qu’ils obtiennent. Cependant, malgré leur potentiel prometteur dans la littérature, nous avons constaté que les résultats étaient décevants dans nos expériences. La qualité des détections,telle que postulée, affecte grandement la qualité des trajectoires finales. Néanmoins, nous avons observé que les informations des étiquettes de classe, ainsi que son score de confiance, sont très utiles pour notre application, où il y a un nombre élevé de variabilité pour les types d’usagers de la route.Ensuite, nous avons concentré nos efforts sur la fusion des entrées de deux sources différentes afin d’obtenir un ensemble d’objets en entrée avec un niveau de précision satisfaisant pour procéder à l’étape de suivi. À ce stade, nous avons travaillé sur l’intégration des boîtes englobantes à partir d’un détecteur multi-classes par apprentissage et d’une méthode basée sur la soustraction d’arrière-plan pour résoudre les problèmes tels que la fragmentation et les représentations redondantes du même objet.---------- ABSTRACT:Multiple object tracking (MOT) is an intensively researched area that have evolved and undergone much innovation throughout the years due to its potential in a lot of applications to improve our quality of life. In our research project, specifically, we are interested in applying MOT in urban traffic scenes to portray an accurate representation of the road user trajectories for the eventual improvements of road traffic systems that affect people from all walks of life. Our first contribution is the introduction of class label information as part of the features that describe the targets and for associating them across frames to capture their motion into trajectories in real environment. We capitalize on that information from a deep learning detector that is used for extraction of objects of interest prior to the tracking procedure, since we were intrigued by their growing popularity and reported good performances. However,despite their promising potential in the literature, we found that the results were disappointing in our experiments. The quality of extracted input, as postulated, critically affects the quality of the final trajectories obtained as tracking output. Nevertheless, we observed that the class label information, along with its confidence score, is invaluable for our application of urban traffic settings where there are a high number of variability in terms of types of road users. Next, we focused our effort on fusing inputs from two different sources in order to obtain a set of objects with a satisfactory level of accuracy to proceed with the tracking stage. At this point, we worked on the integration of the bounding boxes from a learned multi-class object detector and a background subtraction-based method to resolve issues, such as fragmentation and redundant representations of the same object

    Deep Attention Models for Human Tracking Using RGBD

    Get PDF
    Visual tracking performance has long been limited by the lack of better appearance models. These models fail either where they tend to change rapidly, like in motion-based tracking, or where accurate information of the object may not be available, like in color camouflage (where background and foreground colors are similar). This paper proposes a robust, adaptive appearance model which works accurately in situations of color camouflage, even in the presence of complex natural objects. The proposed model includes depth as an additional feature in a hierarchical modular neural framework for online object tracking. The model adapts to the confusing appearance by identifying the stable property of depth between the target and the surrounding object(s). The depth complements the existing RGB features in scenarios when RGB features fail to adapt, hence becoming unstable over a long duration of time. The parameters of the model are learned efficiently in the Deep network, which consists of three modules: (1) The spatial attention layer, which discards the majority of the background by selecting a region containing the object of interest; (2) the appearance attention layer, which extracts appearance and spatial information about the tracked object; and (3) the state estimation layer, which enables the framework to predict future object appearance and location. Three different models were trained and tested to analyze the effect of depth along with RGB information. Also, a model is proposed to utilize only depth as a standalone input for tracking purposes. The proposed models were also evaluated in real-time using KinectV2 and showed very promising results. The results of our proposed network structures and their comparison with the state-of-the-art RGB tracking model demonstrate that adding depth significantly improves the accuracy of tracking in a more challenging environment (i.e., cluttered and camouflaged environments). Furthermore, the results of depth-based models showed that depth data can provide enough information for accurate tracking, even without RGB information