This paper presents a novel view-independent
approach to the recognition of human gestures of several
people in low resolution sequences from multiple calibrated
cameras. In contraposition with other multi-ocular gesture
recognition systems based on generating a classification on
a fusion of features coming from different views, our system
performs a data fusion (3D representation of the scene) and
then a feature extraction and classification. Motion descriptors
introduced by Bobick et al. for 2D data are extended
to 3D and a set of features based on 3D invariant statistical
moments are computed. Finally, a Bayesian classifier is employed
to perform recognition over a small set of actions. Results
are provided showing the effectiveness of the proposed
algorithm in a SmartRoom scenario.Peer ReviewedPostprint (published version

Canton Ferrer, Cristian

Casas Pla, Josep Ramon

Pardàs Feliu, Montse

Sargin, M E

Tekalp, A M

English

UPCommons. Portal del coneixement obert de la UPC

3D Human Action Recognition In Multiple View ScenariosC.Canton-Ferrer1, Student Member, IEEE, J.R.Casas1, Member, IEEE, M.Parda`s1, Member, IEEE,M.E.Sargin2, A.M.Tekalp2, Fellow, IEEE1Image Processing Group, Technical University of Catalonia, Barcelona, Spain2Multimedia, Vision and Graphics Laboratory, Koc¸ University, Istanbul, TurkeyAbstract—This paper presents a novel view-independentapproach to the recognition of human gestures of severalpeople in low resolution sequences from multiple calibratedcameras. In contraposition with other multi-ocular gesturerecognition systems based on generating a classification ona fusion of features coming from different views, our systemperforms a data fusion (3D representation of the scene) andthen a feature extraction and classification. Motion descrip-tors introduced by Bobick et al. for 2D data are extendedto 3D and a set of features based on 3D invariant statisticalmoments are computed. Finally, a Bayesian classifier is em-ployed to perform recognition over a small set of actions. Re-sults are provided showing the effectiveness of the proposedalgorithm in a SmartRoom scenario.Index Terms—Human Gesture Recognition, InformationFusion, 3D Processing, Motion AnalysisI. INTRODUCTIONAnalysis of human motion and gesture in image se-quences is a topic that has been studied extensively [1]and detection and recognition of several human centeredactions are the basis of these studies. The current paperaddresses the problem of recognizing gestures of multiplepersons in a SmartRoom in the framework of a motion-based analysis from multiple views. Multiple camera sys-tems have been widely used for image and video analysistasks in SmartRooms, surveillance, human-computer in-terfaces and scene understanding. From a mathematicalviewpoint, multiple view geometry has been addressed in[2], [3] , but there is still work to do for the efficient fu-sion of information from redundant camera views and itscombination with image analysis techniques for object de-tection, tracking or higher semantic level analysis such asattitudes and behaviors of individuals.Methods for motion-based recognition of human ges-tures proposed in the literature [1] have often been devel-oped to deal with sequences from a single perspective [4],[5]. Considerably less work has been published on recog-nizing human gestures using multiple cameras. Monoc-ular human gesture recognition systems usually requiremotion to be parallel to the camera plane and are verysensitive to occlusions. On the other hand, multiple view-points allow exploiting spatial redundancy, overcome am-biguities caused by occlusion or segmentation errors andImage ForegroundSegmentationNTemporalAnalysisSpace/TimeFilteringFeature Extraction3D InvariantMotion Analysis(Voxelization)Data Fusion3D ProcessAcquisitionRecognition3D ROIGesture ClassificationMotion DetectionFig. 1. System flowchart: acquisition, 3D data generation andfiltering, motion analysis, robust feature extraction and clas-sification.provide 3D position information as well.From an information processing perspective, most ofthe existing approaches to multiple view gesture recogni-tion rely on information fusion at the feature level. Thismeans that multiple inputs are separately analyzed to gen-erate a motion description and then a classification of thegesture over these data is performed [4], [6]. This paperexplores the complementary approach, first performing afusion of the incoming data and then extracting 3D motiondescription features to perform classification.We propose a method for 3D gesture recognition whichis both robust to environmental conditions and computa-tionally simple for real-time applications. Data fusion isachieved by exploiting redundancy among camera viewsto obtain a 3D representation of the scene. For the recog-nition of the movement, an extension of the motion rep-resentations proposed in [4] are presented: Motion His-tory Volume and Motion Energy Volume. Finally, a set ofrobust 3D invariant statistical moments [7] are computedas a feature vector for classification in a Bayesian frame-work. Quantitative results for the proposed algorithm areprovided as well as a comparison with other motion-basedgesture recognition systems [6].This method has been successfully applied to a multi-camera SmartRoom scenario in the framework of a sceneunderstanding project involving recognition of humangestures in meetings. Other fields where our algorithmhas potential applicability are disabled people interfaces,body and gait analysis or domotics.II. SYSTEM OVERVIEWAccording to the flowchart depicted in Fig.I the systemcomprises four data processing modules: image acquisi-tion, 3D data processing and temporal analysis and featureextraction and classification.For a given frame in the video sequence, a set of Nimages are obtained from the N cameras. Each cam-era is modeled using a pinhole camera model based onideal perspective projection. Accurate calibration infor-mation is available. Foreground regions from input im-ages are obtained using a segmentation algorithm basedon Stauffer-Grimson’s background learning and substrac-tion technique [8], [9]. It is assumed that the moving ob-jects are human people. Segmented images, encoded asa binary mask, are the input information for the rest ofimage analysis modules described here since no color in-formation is required.A. 3D Process ModulePrior to any further image analysis, the scene must becharacterized in terms of space disposition and configura-tion of the foreground volumes, i.e. people candidates, inorder to select those potential 3D regions where a gesturemay appear. Images obtained from the multiple view cam-era system allow exploiting spatial redundancies in orderto detect these 3D regions of interest. This task is carriedout by the 3D processing module.Once foreground regions are extracted from the set ofN original images at time t, a set of M 3D points xk,0 ≤ k < M , corresponding to the top of each 3D de-tected volume in the room is obtained by applying a robustBayesian correspondence algorithm and tracking, as de-scribed in [10]. The information given by the establishedcorrespondences allows defining a region of interest (ROI)described by a bounding boxBk, centered on each 3D topxk with an average size adequate to contain a human can-didate (see Fig.2(a)). This process allows reducing thecomplexity of the system discarding empty space regionsnot to be analyzed by forthcoming modules thus increas-ing the performance of the whole system.As mentioned before, our approach to motion-basedgesture recognition relies on feature extraction and clas-sification over a fusion of the incoming information fromthe N cameras. Let us define a general fusion methodfrom the data obtained by all N cameras at time instant tas the setΩ (x, t) ={In (x˜, t) , Bk (x, t) ,R (·)}0 ≤ n < N,(1)where x and x˜ state for 3D and 2D coordinates respec-(a)(b)Fig. 2. Example of the outputs from the 3D processing modulein the SmartRoom scenario. In (a), multiview correspon-dences among regions of interest (ROIs) are correctly estab-lished. In (b), example of the data fusion set Ω (x, t) pro-posed in this paper.tively, In (x˜, t) is the segmented image captured by j-thcamera, Bk (x, t) are the estimated volume ROIs andfunction R(·) denotes the chosen data fusion procedure.In the current scenario where information present in theN images is originated by a common real 3D scene cap-tured from different viewpoints, it is a sound assumptionthat a good data fusion process might be the reconstructionof the 3D scene itself. Other approaches to this problem[11] generate new synthetic views by placing virtual cam-eras in an orthogonal coordinate system related with thecenter of the action as a data fusion process. By workingdirectly on the 3D result of the data fusion, our approachbetter captures the information available from the multi-ple views avoiding any redundancy on the data fed to theanalyzer.Taking the data provided by the foreground segmen-tation and the ROIs as input, reconstruction of 3D mov-ing objects in the scene can be achieved by defining R(·)as a N -view silhouette consistency check [12]. This pro-cess generates a discrete occupancy representation of the3D space (voxels). Information derived from the multipleROIs allow labeling the voxels as belonging to one personor another. In spite of this fairly simple election of R(·)compared with more complex reconstruction procedures[13], data fusion still achieves enough accuracy for ourpurposes.The data obtained with this 3D reconstruction is cor-rupted by spurious voxels introduced due to wrong seg-mentation, camera calibration inaccuracies, etc. Tempo-ral analysis module placed next in the processing chainhighly depends on the reliability of the data fusion hence,noise voxels should be removed not to be detected as mo-tion. A connectivity filter is introduced in order to removethese voxels by checking its connectivity consistency withits neighbours in both space and time. An example of theoutput of the whole 3D processing module is depicted inFig.2(b)B. Temporal Analysis ModuleIn order to achieve a simple and efficient low levelview-dependent motion representation, [4] introduced theconcept of Motion History Image (MHV) and Motion En-ergy Image (MEI). In this paper, we extended the sameformulation to represent view-independent 3D motion.Analogously to [4], [5], the binary Motion Energy Vol-ume (MEV) Eτ (x, t) is defined as:Eτ (x, t) =τ−1⋃i=0ΩD (x, t− i) , (2)where ΩD(x, t) is the binary data set indicating regionsof motion. This measure captures the 3D locations wherethere is motion in the last τ frames. Motion detection cap-tured in ΩD(x, t) can be coarsely estimated by a simpleforward differentiation among voxel frames. It should benoted that τ is a crucial parameter in defining the tempo-ral extent of a gesture. In Fig.3(a), an example of MEV isdepicted.To represent the temporal evolution of the motion, wedefine the Motion History Volume (MHV) where eachvoxel intensity is a function of the temporal history of themotion at that 3D location. Formally,Hτ (x, t) ={τ if ΩD (x, t) = 1max [0, Hτ (x, t− 1)− 1] otherwise.(3)This particular choice of temporal projection operator hasthe advantage that computation is recursive thus beinga good representation for a real-time gesture recognitionsystem. An example of MHV is shown in Fig.3(b).Estimating a right value of the time factor τ (memoryof the system) is critical to extract meaningful featuresto perform classification. Start and end of an action canbe estimated adaptively by analyzing the volume activityof ΩD (x, t): when there is an action starting, motion in-creases suddenly thus triggering the MHV computation(a)(b)Fig. 3. Example of motion descriptors. In (a) and (b) are de-picted the 2D projections of MEV and MHV respectivelyfor gestures sitting down and raising hand.Vol(ΩD)τAthFig. 4. Estimation of time decay parameter τ of hand wavingaction by looking at the volume of the motion detection setΩD (x, t).until a gesture ends and motion activity decreases below athreshold Ath (see Fig.4).III. FEATURE EXTRACTION AND GESTURECLASSIFICATIONMotion described at a low level using just image pro-cessing techniques requires a very high dimensional spaceto represent it. Methods to represent motion in a low-dimensional space are therefore desirable. Hence, infor-mative features derived from the analyzed data (MHV andMEV in our case) are required.Statistical moments invariant to scaling, translation, ro-tation and affine mappings were early introduced by [14]for character recognition tasks. Their invariance prop-erties yield to robust and informative features suitablefor classification tasks and have been used in other 2Dmotion-based human gesture approaches [4], [5], [6]. Theproposed system extends the usage of invariant momentsto be computed over our data sets as classification fea-tures. Nevertheless, since our system is based on a datafusion prior to the classification process, 3D invariant sta-tistical moments are required. These type of features havebeen already used in brain tissue classification tasks [15]and can be derived analytically. The reader is referred toLo and Don’s method [7] for a detailed description of theconstruction of invariant statistical moments of arbitrarydimension. For each data set Eτ (x, t) and Hτ (x, t), 5invariant moment-based features are computed. Let us de-note the set of these features as ψ.Given the computed moment-based features obtainedfor each of the actions to classify ωj , 0 ≤ j ≤K, we define a full 10-dimensional feature vector asΓ = [ψMEV ψMHV]. Even though the dimensionality of Γis very reduced, empty-space related problems arise whenestimating class distributions [16]. Such effects decreasethe efficiency of classification but this problem can betackled by finding a transformed representation of data ina compact reduced dimensional space through PrincipalComponent Analysis (PCA) [16]. By analyzing the train-ing data we noticed that 90% of the variance of the datawas achieved by using a dimension reduction to d = 7.The classification method is based on a Bayesian clas-sification criterium assuming that p(Γ|ωj) is normally dis-tributed. Since the noise in our data is the result of thesum of contribution from a large number of independentsources, Central Limit Theorem grants consistency to theGaussianity assumption of our data. Indeed, further em-piric tests [16] corroborate this assumption. Given an ob-servation represented by Γ, its classification is expressedby the maximum likelihood principle:arg maxωjp(ωj |Γ), (4)where the posterior probability of a certain class ωj givenan observation Γ is formallyp (ωj |Γ) =p (Γ|ωj) p (ωj)p (Γ). (5)Since p(ωj) and p(Γ) factors are wide and uninforma-tive, Eq.5 can be expressed asp (ωj |Γ) ∝ p (Γ|ωj) , (6)where p (Γ|ωj) is modeled as a multivariate Gaussian dis-tribution defined by its mean µ and covariance matrix Σ.Training data is used to estimate (µ,Σ)j for each classin order to compute the class-likelihood discriminant inEq.4.IV. RESULTS AND DISCUSSIONIn order to evaluate the performance of the proposedalgorithm, we collected a set of 70 training and 30 test-ing multi-view sequences of each action to be recognized.The analysis sequences were recorded with 5 fully cal-ibrated wide angle lense cameras in the SmartRoom atUPC with a resolution of 768x576 pixels at 25 fps (see asample in Fig.2(a)). The gesture category set was formedby 8 common actions of interest in the field of human-computer interfaces such as raising hand, sitting down,waving hands, crouching down, standing up, punching,kicking or jumping. Moreover, to show the effective-ness of the moment invariant based features, actions wererecorded in different positions inside the room and facingvarious orientations.Quantitative results showed in Table I prove the ef-ficiency of the proposed algorithm to recognize humangestures from the given dataset. In average, we got ap(error) = 0.0375. However, our method is conditionedby the initial foreground segmentation step thus being sen-sitive to the colours of the clothes of the people in thescene.TABLE ICONFUSION MATRIX INDICATING THE p(ERROR) OF THEBAYESIAN CLASSIFIER AMONG GESTURE CLASSES.ω0 ω1 ω2 ω3 ω4 ω5 ω6 ω7ω0 - 0.0 0.0 0.0 0.0 0.0 0.0 0.0ω1 0.0 - 0.0 0.0 0.0 0.0 0.0 0.08ω2 0.0 0.0 - 0.0 0.0 0.0 0.0 0.0ω3 0.0 0.0 0.0 - 0.0 0.0 0.0 0.0ω4 0.0 0.1 0.0 0.0 - 0.0 0.0 0.0ω5 0.0 0.0 0.0 0.0 0.0 - 0.0 0.0ω6 0.0 0.0 0.06 0.0 0.0 0.0 - 0.0ω7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -Multiple view motion-based recognition of gesture iscommonly addressed by the complementary informationprocessing paradigm relying on feature fusion and clas-sification. For comparison purposes, we took the resultsprovided in [4] where the alternative approach to multi-ocular recognition of gestures is analyzed. Similar errorratios are achieved but always relying on the assumptionthat only one individual is present in the scene and thereare no occlusions.V. CONCLUSIONS AND FUTURE WORKWe presented an efficient technique for motion-basedview-independent gesture recognition in a multiple cam-era view environment. This paper explores the informa-tion processing methodology based on first performing afusion of the incoming data and then extracting 3D motiondescription features to perform classification.Information provided by multiple views originatedfrom the same real 3D world is better captured when beinganalyzed by a data-level fusion instead of a feature-levelfusion. Experimental results proved the efficiency of ourmethod proposing an alternative to the classical method-ology to multi-ocular and mono-ocular motion-based ges-ture analysis [4], [6], [5].Future research within this topic involve developingmore data fusion strategies involving color to generateinformative descriptions of motion. More sophisticatedclassification techniques and 3D color related features areunder research. Combination of motion detection togetherwith a prior estimation of the body position of the personmight allow a higher semantic analysis of the actions.REFERENCES[1] J. K. Aggarwal and Q. Cai, “Human motion analysis: a review,” inProc. IEEE Nonrigid and Articulated Motion Workshop, 1997, pp.90–102.[2] O. Faugeras and Q. T. Luong, The geometry of multiple views, MITPress, 2001.[3] R. I. Hartley and A. Zisserman, Multiple View Geometry in Com-puter Vision, Cambridge University Press, 2004.[4] A. F. Bobick and J. W. Davis, “The Recognition of Human Move-ment Using Temporal Templates,” IEEE Trans. Pattern Anal. Ma-chine Intell., vol. 23, pp. 257–267, Mar. 2001.[5] G. R. Bradski and J. W. Davis, “Motion Segmentation and PoseRecognition with Motion History Gradients,” Machine Vision andApplications, vol. 13:3, pp. 174–184, July 2002.[6] R. Rosales, “Recognition of Human Action Using Moment-BasedFeatures,” Tech. Rep., Boston University, 1998.[7] C. Lo and H. Don, “3-D Moment Forms: Their Constructionand Application to Object Identification and Positioning,” IEEETrans. Pattern Anal. Machine Intell., vol. 11:10, pp. 1053–1063,Oct. 1989.[8] C. Stauffer and W. Grimson, “Adaptive background mixture mod-els for real-time tracking,” in Proc. IEEE Int. Conf. on ComputerVision and Pattern Rec., 1999, pp. 252–259.[9] J. L. Landabaso, L. Q. Xu, and M. Parda`s, “Robust tracking andobject classification towards automated video surveillance,” inProc. Int. Conf. on Image Analysis and Recognition, 2004, pp.463–470.[10] C.Canton-Ferrer, J.R.Casas, M.Tekalp, and M.Parda`s, “Projec-tive Kalman Filter: Multiocular Tracking of 3D Locations TowardsScene Understanding,” Lecture Notes on Computer Science, vol.3869, pp. 250–261, 2006.[11] R. Bodor, B. Jackson, O. Masoud, and N. Papanikolopoulos,“Image-Based Reconstruction for View-Independent Human Mo-tion Recognition,” in Proc. IEEE Int. Conf. on Intelligent Robotsand Systems, Oct. 2003, pp. 1548–1553.[12] J. L. Landabaso and M. Parda`s, “Foreground Regions Extractionand Characterization Towards Real-Time Object Tracking,” Lec-ture Notes on Computer Science, vol. 3869, pp. 241–249, 2006.[13] G. Cheung, T. Kanade, J. Y. Bouguet, and M. Holler, “A Real-TimeSystem for Robust 3D Voxel Reconstruction of Human Motions,”in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recogni-tion, 2000.[14] M. K. Hu, “Visual Pattern Recognition by Moment Invariants,”IRE Trans. on Information Theory, vol. 8:2, pp. 179–187, Feb.1962.[15] J.F. Mangin et al., “Brain morphometry using 3D moments invari-ants,” Medical Image Analysis, vol. 8:3, pp. 187–196, Aug. 2004.[16] R. Duda and P. Hart, Pattern Classification, John Wiley and Sons,2001.

3D human action recognition in multiple view scenarios

https://upcommons.upc.edu/bitstream/2117/23651/1/3D%20Human%20Action%20Recognition%20In%20Multiple%20View%20Scenarios.pdf

3D human action recognition in multiple view scenarios

Abstract

Similar works

Full text

Available Versions

UPCommons. Portal del coneixement obert de la UPC