In order to improve the accuracy in video-based object detection, the proposed multi-modal video surveillance system takes advantage of the different kinds of information represented by visual, thermal and/or depth imaging sensors.
 
The multi-modal object detector of the system can be split up in two consecutive parts: the registration and the coverage analysis. The multi-modal image registration is performed using a three step silhouette-mapping algorithm which detects the rotation, scale and translation between moving objects in the visual, (thermal) infrared and/or depth images. First, moving object silhouettes are extracted to separate the calibration objects, i.e., the foreground, from the static background. Key components are dynamic background subtraction, foreground enhancement and automatic thresholding. Then, 1D contour vectors are generated from the resulting multi-modal silhouettes using silhouette boundary extraction, cartesian to polar transform and radial vector analysis. Next, to retrieve the rotation angle and the scale factor between the multi-sensor image, these contours are mapped on each other using circular cross correlation and contour scaling. Finally, the translation between the images is calculated using maximization of binary correlation. 

The silhouette coverage analysis also starts with moving object silhouette extraction. Then, it uses the registration information, i.e., rotation angle, scale factor and translation vector, to map the thermal, depth and visual silhouette images on each other. Finally, the coverage of the resulting multi-modal silhouette map is computed and is analyzed over time to reduce false alarms and to improve object detection.
 
Prior experiments on real-world multi-sensor video sequences indicate that automated multi-modal video surveillance is promising. This paper shows that merging information from multi-modal video further increases the detection results

De Potter, Pieterjan

Hollemeersch, Charles

Lambert, Peter

Poppe, Chris

Van de Walle, Rik

Van Hoecke, Sofie

Verstockt, Steven

English

Ghent University Academic Bibliography

  biblio.ugent.be  The UGent Institutional Repository is the electronic archiving and dissemination platform for all UGent research publications. Ghent University has implemented a mandate stipulating that all academic publications of UGent researchers should be deposited and archived in this repository. Except for items where current copyright restrictions apply, these papers are available in Open Access.  This item is the archived peer‐reviewed author‐version of: Silhouette Coverage Analysis for Multi‐modal Video Surveillance  Steven Verstockt, Chris Poppe, Pieterjan De Potter, Charles Hollemeersch,  Sofie Van Hoecke, Peter Lambert, Rik Van de Walle Proceedings of the 29th Progress in Electromagnetics Research Symposium (PIERS),  pp. 1279‐1283, 2011.  To refer to or to cite this work, please use the citation to the published version: Steven Verstockt, Chris Poppe, Pieterjan De Potter, Charles Hollemeersch,  Sofie Van Hoecke, Peter Lambert, Rik Van de Walle (2011). Silhouette Coverage Analysis for Multi‐modal Video Surveillance. Proceedings of the 29th Progress in Electromagnetics Research Symposium (PIERS), pp. 1279‐1283. Progress In Electromagnetics Research Symposium Proceedings,Marrakesh,Morocco,Mar. 20–23, 2011 1279Silhouette Coverage Analysis for Multi-modal Video SurveillanceS. Verstockt1, 2, C. Poppe1, P. De Potter1, C. Hollemeersch1,S. Van Hoecke2, P. Lambert1, and R. Van de Walle11ELIS Department Multimedia Lab, Ghent University, IBBT, Belgium2ELIT Lab, University College West Flanders, Ghent University Association, BelgiumAbstract— In order to improve the accuracy in video-based object detection, the proposedmulti-modal video surveillance system takes advantage of the different kinds of information rep-resented by visual, thermal and/or depth imaging sensors. The multi-modal object detector ofthe system can be split up in two consecutive parts: the registration and the coverage analysis.The multi-modal image registration is performed using a three step silhouette-mapping algorithmwhich detects the rotation, scale and translation between moving objects in the visual, (thermal)infrared and/or depth images. First, moving object silhouettes are extracted to separate the cal-ibration objects, i.e., the foreground, from the static background. Key components are dynamicbackground subtraction, foreground enhancement and automatic thresholding. Then, 1D con-tour vectors are generated from the resulting multi-modal silhouettes using silhouette boundaryextraction, cartesian to polar transform and radial vector analysis. Next, to retrieve the rotationangle and the scale factor between the multi-sensor image, these contours are mapped on eachother using circular cross correlation and contour scaling. Finally, the translation between theimages is calculated using maximization of binary correlation.The silhouette coverage analysis also starts with moving object silhouette extraction. Then,it uses the registration information, i.e., rotation angle, scale factor and translation vector, tomap the thermal, depth and visual silhouette images on each other. Finally, the coverage ofthe resulting multi-modal silhouette map is computed and is analyzed over time to reduce falsealarms and to improve object detection.Prior experiments on real-world multi-sensor video sequences indicate that automated multi-modal video surveillance is promising. This paper shows that merging information from multi-modal video further increases the detection results.1. INTRODUCTIONThe growing demand for security has given raise to the increased use of video surveillance systemsin recent years. Surveillance cameras are rapidly appearing in all sort of places and a huge numberof visual object detection algorithms, which automatically process these camera images, have beenproposed in literature. However, due to the variability of shape, motion, colors, and patterns ofmoving objects, and also due to the dynamic character of the background, many of these visualobject detectors are still vulnerable to false and missed detections. To avoid the disadvantages ofusing visual sensors alone, we believe the use of other types of imagery, e.g., thermal infrared (IR)and Time-of-Flight (ToF) depth images, can be of added value. The combination of this typesof imagery yields information about the scene that is rich in color, motion, depth and/or thermaldetail. Once such information is registered, i.e., aligned with each other, it can be used to improvedetection performance and activity analysis in the scene. Since each type of sensor has its owntype of detection limitations, misdetections in one sensor can be corrected by the other sensors. Assuch, the combination of multi-sensor information is considered to be a win-win.In order to combine multi-modal images, it is required that the corresponding objects in the sceneare aligned, or registered. The goal of registration is to establish geometric correspondence betweenthe images so that they may be transformed, compared, and analyzed in a common reference frame.Usual features used for multi-sensor registration are edges, corners, and contours [1]. Since contoursrepresenting the region boundaries are preserved in most cases, object silhouettes form the mostreliable correspondence between objects in color, thermal and/or depth image pairs [2]. For thisreason, they are also used in our multi-modal video surveillance system.The remainder of this paper is organized as follows. Section 2 gives a global description of thesilhouette-based registration of multi-modal images, which is based on moving object silhouetteextraction, contour vector generation, contour mapping and binary correlation. As an example,the registration of visual and long-wave infrared (LWIR) images is shown. Subsequently, Section3 discusses the silhouette coverage analysis, i.e., the multi-modal merging of the detection results1280 PIERS Proceedings, Marrakesh, MOROCCO, March 20–23, 2011from the visual, thermal and/or depth image sensors. By two use cases, i.e., a shadow removaland a smoke detection experiment, we show how the coverage analysis of multi-modal images canbe used to obtain better object detection results than either sensor alone. Next, in Section 4, weprovide details of the experimental setup. Finally, Section 5 ends this paper with the conclusions.2. SILHOUETTE-BASED REGISTRATION OF MULTI-MODAL IMAGESThe multi-modal image registration (Fig. 1) starts with a moving object silhouette extraction [2]to separate the calibration objects, i.e., the moving foreground, from the static background. Keycomponents are the dynamic background subtraction, automatic thresholding and (iterative) mor-phological filtering. The dynamic background subtraction [3] extracts the moving foreground (FG)out of the visual and thermal video frames using a visual background estimation, which is updateddynamically. By subtracting the frames with everything in the scene that remains constant overtime, i.e., the background, only the moving part of those images remains. After this backgroundsubtraction, the resulting foreground images are thresholded automatically using automatic gammacorrection, (adaptive) k-means clustering and morphological filtering with growing structuring el-ements, which grow iteratively until the resulting silhouette is suitable for multi-modal silhouettematching. The combination of all these steps achieves favorable results, as is shown by the visualand the LWIR silhouette extraction in our experiments (Fig. 4). Similar results can be expectedfor ToF depth silhouette extraction.After the silhouettes are extracted, registration of both images is performed using a three stepregistration algorithm. Like in [4], the registration algorithm assumes that the geometric transfor-mation between the multi-sensor images is a rigid transformation, which can be decomposed intoa 2D rotation, scaling and translation. To estimate each of these three geometric parameters, thecontours and the correlation of the visual and thermal silhouettes are analyzed. First, the rotationis computed using silhouette contour extraction and circular cross correlation [5], which analyzesthe translation of the 1-D contour centroid distance (CCD) of both silhouettes. As such, the 2Dsilhouette matching problem is converted to a one-dimensional signal matching problem. Afterrotating, the scale factor between both views is estimated by analyzing the ratio of the thermalFigure 1: Silhouette-based image registration of thermal and visual images.Figure 2: Experimental results of LWIR-visual registration.Progress In Electromagnetics Research Symposium Proceedings,Marrakesh,Morocco,Mar. 20–23, 2011 1281and visual aligned CCDs. Since the thermal-visual CCD ratios are not constant and show somekind of disorder, the median ratio is chosen as an adequate scale factor. Finally, the translationvector is estimated using the binary correlation technique proposed by Chen et al. [2], which isbased on template matching in the frequency domain. As the registration result in (Fig. 2) show,the proposed registration algorithm is able to coarsely map visual and thermal object silhouettes.3. SILHOUETTE COVERAGE ANALYSISThe silhouette coverage analysis (Fig. 3) also starts with the moving object silhouette extraction,which was already discussed in the previous section. Then, it uses the registration information, i.e.,rotation angle, scale factor and translation vector, to map the thermal and visual silhouette imageson each other. As soon as this mapping is finished, the combined LWIR-visual silhouette map isanalyzed over time using a temporal coverage analysis algorithm. Depending the video surveillanceapplication for which the multi-modal analysis is used, this silhouette coverag analysis (SCA) canbe performed in different ways. In the following subsections, two exemplary use cases of how theSCA can be used are given. In the first use case, the SCA is used for shadow removal in visualimages. In the second use case, the SCA is used as a first warning method for smoke detection.3.1. Use Case 1: Shadow RemovalShadows are a main drawback for all visual surveillance applications and affect the accuracy of thesystem performance. Since shadows do not occur in thermal or ToF depth images, both types ofimagery can be used to discard them in visual images. This is also shown by the first experimentin (Fig. 4(a)). In this experiment, the multi-modal SCA is used to count the number of people ina room. Due to their shadows, the visual silhouettes of both persons overlap in the visual images.Without the LWIR-visual SCA, a visual people counter could miscount the number of people as1. By using the LWIR-visual SCA we can correct this mistake. As can be seen in (Fig. 4(a)),the registered visual and thermal silhouettes do not overlap in the shadow regions, i.e., the grayFigure 3: Silhouette coverage analysis.(a) (b)Figure 4: Experimental results of silhouette coverage analysis for (a) shadow removal and (b) smoke detection.1282 PIERS Proceedings, Marrakesh, MOROCCO, March 20–23, 2011Figure 5: “Car park fire [8]” test results of SCA-based smoke detection.regions. As such, by only counting the regions which occur in both thermal and visual images,i.e., the white regions, and by analyzing if this regions are stable over time, the SCA results ina more robust and efficient people counter. Similar results are expected with visual-ToF depthSCA analysis. The bounding boxes, shown in the figure, were created by calculating the smallestenclosing rectangle (whose sides are parallel to the x and y axes) around the common, i.e., white,visual-LWIR regions.3.2. Use Case 2: Smoke DetectionAlthough smoke is almost transparent in LWIR images, we can make use of its absence to detect it.Since ordinary moving objects, such as people, cars, etc., produce similar silhouettes in background-subtracted visual and thermal IR images, the coverage between these images is quasi constant. Thiscan also be seen in the coverage graph of experiment 1. The coverage for the moving people staysquasi constant over all the frames. Smoke, contrarely, will only be detected in the visual images,and as such the coverage will start to decrease (Fig. 4(b)). This decrease can be detected using asequence/scene independent technique based on slope analysis of the linear fit, i.e., trend line, overthe most recent silhouette coverage values. If the slope of this trend line is negative and decreasescontinuously, smoke warning is given. Due to its dynamic character, the visual silhouettes of asmoke region will also show a high degree of turbulence [6]. By focusing on both the visible-invisblecharacter of smoke and its visual disorder, a multi-sensor detector can detect smoke very accurately.Compared to the results of any individual detector in [7], the 2-phase multi-sensor smoke detectoris able to detect the smoke more accurate, i.e., with less missdetections and false alarms. This is alsoillustrated by the test results of a car park fire [8] in (Fig. 5). Due to the low-cost of the silhouettecoverage analysis and the visual disorder analysis, which is only performed if smoke warning isgiven, the algorithm is also less compuational expensive as many of the individual detectors.4. EXPERIMENTAL SETUPThe multi-modal sequences were acquired by a Xenics Gobi-384 LWIR camera and a CANONMD110 camera, which works in the 8–14µm spectral range and the visible spectrum respectively.The Gobi thermal imager has a resolution of 384× 288 pixels, and a frame rate of 28–30 fps. TheCANON its resolution is 576× 720 and its framerate is 25 fps. In order to cope with the differentframe rates and resolutions, and also whith the differences in the the field of view of the cameras,the multi-modal frames are spatio-temporal registered using temporal frame alignment and thesilhouette-based registration proposed in this paper.5. CONCLUSIONSMulti-modal video surveillance takes advantage of the different kinds of information representedby thermal, visual and/or depth images in order to accurately detect moving objects. By fusingthe different modalities and using the strengths of each medium, object detection can be donemore accurate and with less false detections, as is shown by two use cases in this paper. Merginginformation from multiple types of image sensors has, as such, proven to be a win-win.Progress In Electromagnetics Research Symposium Proceedings,Marrakesh,Morocco,Mar. 20–23, 2011 1283To detect the presence of objects, the detector analyzes the silhouette coverage of moving ob-jects in multi-modal registered images. In order to register the multi-sensor images, the proposedalgorithm analyses the contours and the correlation of visual and thermal FG silhouettes. First,the rotation is computed using silhouette contour extraction and circular cross correlation. Next,contour scaling is used to estimate the thermal-visual scale factor. Finally, the translation vectoris estimated by maximization of binary correlation.The geometric parameters found during this registration phase are further used by the detectorto coarsely map the silhouette images and coverage between them is calculated. Depending thevideo surveillance application for which the multi-modal analysis is used, this coverage can thenbe further used to improve the detection results, as is shown by the people counter and the smokedetection experiment.Future work will mainly focus on the improvement of the registration results. Currently, onlythe binary silhouettes of the calibration objects are used to do the registration. We expect that afirst improvement can be made by also incorporating their gray-scale information, especially in thetranslation estimation. As the contour mapping is based on the boundary correspondences it is notexpected that grayscale information will lead to better results in the rotation and scale estimation.However, further testing is necessary to confirm this. Also the use other types of class clusteringclassifiers will be further investigated.ACKNOWLEDGMENTThe research activities as described in this paper were funded by Ghent University, the Interdisci-plinary Institute for Broadband Technology (IBBT), University College West Flanders, WarringtonFire Ghent, the Institute for the Promotion of Innovation by Science and Technology in Flanders(IWT), the Fund for Scientific Research-Flanders, the Belgian Federal Science Policy Office (BF-SPO), and the European Union.REFERENCES1. Zitova, B. and J. Flusser,, “Image registration methods: A survey,” Image and Vision Com-putingn, Vol. 21, 977–1000, 2003.2. Chen, H.-M., S. Lee, R. M. Rao, M.-A. Slamani, and P. K. Varshney, “Imaging for concealedweapon detection,” IEEE Signal Processing Magazine, 52–61, March 2005.3. Toreyin, B. U., Y. Dedeoglu, U. Gdkbay, and A. E. Cetin, “Computer vision based methodfor real-time fire and flame detection,” Pattern Recognition Letters, Vol. 27, 49–58, 2006.4. Han, J. and B. Bhanu, “Fusion of color and infrared video for moving human detection,”Pattern Recognition, Vol. 40, 1771–1784, 2007.5. Hamici, Z., “Real-time pattern recognition using circular cross-correlation: A robot visionsystem,” International journal of Robotics and Automation, Vol. 21, pp 174–183, 2006.6. Verstockt, S., A. Vanoosthuyse, S. van Hoecke, P. Lambert, and R. van de Walle, “Multi-sensor fire detection by fusing visual and non-visual flame features,” International Conferenceon Image and Signal Processing (ICISP), 333–341, June 2010.7. Verstockt, S., B. Merci, B. Sette, P. Lambert, and R. van de Walle, “State of the art in vision-based fire and smoke detection,” 14th International Conference on Automatic Fire Detection,Vol. 2, 285–292, September 2009.8. Merci, B., “Fire safety and explosion safety in car parks,” http://www.carparkfiresafety.be/.

Silhouette coverage analysis for multi-modal video surveillance

https://biblio.ugent.be/publication/1207419/file/1207432

Silhouette coverage analysis for multi-modal video surveillance

Abstract

Similar works

Full text

Available Versions

Ghent University Academic Bibliography