151,111 research outputs found

    Multi-set canonical correlation analysis for 3D abnormal gait behaviour recognition based on virtual sample generation

    Get PDF
    Small sample dataset and two-dimensional (2D) approach are challenges to vision-based abnormal gait behaviour recognition (AGBR). The lack of three-dimensional (3D) structure of the human body causes 2D based methods to be limited in abnormal gait virtual sample generation (VSG). In this paper, 3D AGBR based on VSG and multi-set canonical correlation analysis (3D-AGRBMCCA) is proposed. First, the unstructured point cloud data of gait are obtained by using a structured light sensor. A 3D parametric body model is then deformed to fit the point cloud data, both in shape and posture. The features of point cloud data are then converted to a high-level structured representation of the body. The parametric body model is used for VSG based on the estimated body pose and shape data. Symmetry virtual samples, pose-perturbation virtual samples and various body-shape virtual samples with multi-views are generated to extend the training samples. The spatial-temporal features of the abnormal gait behaviour from different views, body pose and shape parameters are then extracted by convolutional neural network based Long Short-Term Memory model network. These are projected onto a uniform pattern space using deep learning based multi-set canonical correlation analysis. Experiments on four publicly available datasets show the proposed system performs well under various conditions

    Gait recognition and understanding based on hierarchical temporal memory using 3D gait semantic folding

    Get PDF
    Gait recognition and understanding systems have shown a wide-ranging application prospect. However, their use of unstructured data from image and video has affected their performance, e.g., they are easily influenced by multi-views, occlusion, clothes, and object carrying conditions. This paper addresses these problems using a realistic 3-dimensional (3D) human structural data and sequential pattern learning framework with top-down attention modulating mechanism based on Hierarchical Temporal Memory (HTM). First, an accurate 2-dimensional (2D) to 3D human body pose and shape semantic parameters estimation method is proposed, which exploits the advantages of an instance-level body parsing model and a virtual dressing method. Second, by using gait semantic folding, the estimated body parameters are encoded using a sparse 2D matrix to construct the structural gait semantic image. In order to achieve time-based gait recognition, an HTM Network is constructed to obtain the sequence-level gait sparse distribution representations (SL-GSDRs). A top-down attention mechanism is introduced to deal with various conditions including multi-views by refining the SL-GSDRs, according to prior knowledge. The proposed gait learning model not only aids gait recognition tasks to overcome the difficulties in real application scenarios but also provides the structured gait semantic images for visual cognition. Experimental analyses on CMU MoBo, CASIA B, TUM-IITKGP, and KY4D datasets show a significant performance gain in terms of accuracy and robustness

    Depth Images-based Human Detection, Tracking and Activity Recognition Using Spatiotemporal Features and Modified HMM

    Get PDF
    Human activity recognition using depth information is an emerging and challenging technology in computer vision due to its considerable attention by many practical applications such as smart home/office system, personal health care and 3D video games. This paper presents a novel framework of 3D human body detection, tracking and recognition from depth video sequences using spatiotemporal features and modified HMM. To detect human silhouette, raw depth data is examined to extract human silhouette by considering spatial continuity and constraints of human motion information. While, frame differentiation is used to track human movements. Features extraction mechanism consists of spatial depth shape features and temporal joints features are used to improve classification performance. Both of'these features are fused together to recognize different activities using the modified hidden Markov model (M-HMM). The proposed approach is evaluated on two challenging depth video datasets. Moreover, our system has significant abilities to handle subject's body parts rotation and body parts missing which provide major contributions in human activity recognition.1165Ysciescopuskc

    View and clothing invariant gait recognition via 3D human semantic folding

    Get PDF
    A novel 3-dimensional (3D) human semantic folding is introduced to provide a robust and efficient gait recognition method which is invariant to camera view and clothing style. The proposed gait recognition method comprises three modules: (1) 3D body pose, shape and viewing data estimation network (3D-BPSVeNet); (2) gait semantic parameter folding model; and (3) gait semantic feature refining network. First, 3D-BPSVeNet is constructed based on a convolution gated recurrent unit (ConvGRU) to extract 2-dimensional (2D) to 3D body pose and shape semantic descriptors (2D-3D-BPSDs) from a sequence of gait parsed RGB images. A 3D gait model with virtual dressing is then constructed by morphing the template of 3D body model using the estimated 2D-3D-BPSDs and the recognized clothing styles. The more accurate 2D-3D-BPSDs without clothes are then obtained by using the silhouette similarity function when updating the 3D body model to fit the 2D gait. Second, the intrinsic 2D-3D-BPSDs without interference from clothes are encoded by sparse distributed representation (SDR) to gain the binary gait semantic image (SD-BGSI) in a topographical semantic space. By averaging the SD-BGSIs in a gait cycle, a gait semantic folding image (GSFI) is obtained to give a high-level representation of gait. Third, a gait semantic feature refining network is trained to refine the semantic feature extracted directly from GSFI using three types of prior knowledge, i.e., viewing angles, clothing styles and carrying condition. Experimental analyses on CMU MoBo, CASIA B, KY4D, OU-MVLP and OU-ISIR datasets show a significant performance gain in gait recognition in terms of accuracy and robustness

    3-D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold

    Get PDF
    International audienceRecognizing human actions in 3D video sequences is an important open problem that is currently at the heart of many research domains including surveillance, natural interfaces and rehabilitation. However, the design and development of models for action recognition that are both accurate and efficient is a challenging task due to the variability of the human pose, clothing and appearance. In this paper, we propose a new framework to extract a compact representation of a human action captured through a depth sensor, and enable accurate action recognition. The proposed solution develops on fitting a human skeleton model to acquired data so as to represent the 3D coordinates of the joints and their change over time as a trajectory in a suitable action space. Thanks to such a 3D joint-based framework, the proposed solution is capable to capture both the shape and the dynamics of the human body simultaneously. The action recognition problem is then formulated as the problem of computing the similarity between the shape of trajectories in a Riemannian manifold. Classification using kNN is finally performed on this manifold taking advantage of Riemannian geometry in the open curve shape space. Experiments are carried out on four representative benchmarks to demonstrate the potential of the proposed solution in terms of accuracy/latency for a low-latency action recognition. Comparative results with state-of-the-art methods are reported

    ๋‹จ์ผ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์˜ ํ‘œํ˜„์  ์ „์‹  3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ •

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ์ด๊ฒฝ๋ฌด.Human is the most centric and interesting object in our life: many human-centric techniques and studies have been proposed from both industry and academia, such as motion capture and human-computer interaction. Recovery of accurate 3D geometry of human (i.e., 3D human pose and shape) is a key component of the human-centric techniques and studies. With the rapid spread of cameras, a single RGB image has become a popular input, and many single RGB-based 3D human pose and shape estimation methods have been proposed. The 3D pose and shape of the whole body, which includes hands and face, provides expressive and rich information, including human intention and feeling. Unfortunately, recovering the whole-body 3D pose and shape is greatly challenging; thus, it has been attempted by few works, called expressive methods. Instead of directly solving the expressive 3D pose and shape estimation, the literature has been developed for recovery of the 3D pose and shape of each part (i.e., body, hands, and face) separately, called part-specific methods. There are several more simplifications. For example, many works estimate only 3D pose without shape because additional 3D shape estimation makes the problem much harder. In addition, most works assume a single person case and do not consider a multi-person case. Therefore, there are several ways to categorize current literature; 1) part-specific methods and expressive methods, 2) 3D human pose estimation methods and 3D human pose and shape estimation methods, and 3) methods for a single person and methods for multiple persons. The difficulty increases while the outputs of methods become richer by changing from part-specific to expressive, from 3D pose estimation to 3D pose and shape estimation, and from a single person case to multi-person case. This dissertation introduces three approaches towards expressive 3D multi-person pose and shape estimation from a single image; thus, the output can finally provide the richest information. The first approach is for 3D multi-person body pose estimation, the second one is 3D multi-person body pose and shape estimation, and the final one is expressive 3D multi-person pose and shape estimation. Each approach tackles critical limitations of previous state-of-the-art methods, thus bringing the literature closer to the real-world environment. First, a 3D multi-person body pose estimation framework is introduced. In contrast to the single person case, the multi-person case additionally requires camera-relative 3D positions of the persons. Estimating the camera-relative 3D position from a single image involves high depth ambiguity. The proposed framework utilizes a deep image feature with the camera pinhole model to recover the camera-relative 3D position. The proposed framework can be combined with any 3D single person pose and shape estimation methods for 3D multi-person pose and shape. Therefore, the following two approaches focus on the single person case and can be easily extended to the multi-person case by using the framework of the first approach. Second, a 3D multi-person body pose and shape estimation method is introduced. It extends the first approach to additionally predict accurate 3D shape while its accuracy significantly outperforms previous state-of-the-art methods by proposing a new target representation, lixel-based 1D heatmap. Finally, an expressive 3D multi-person pose and shape estimation method is introduced. It integrates the part-specific 3D pose and shape of the above approaches; thus, it can provide expressive 3D human pose and shape. In addition, it boosts the accuracy of the estimated 3D pose and shape by proposing a 3D positional pose-guided 3D rotational pose prediction system. The proposed approaches successfully overcome the limitations of the previous state-of-the-art methods. The extensive experimental results demonstrate the superiority of the proposed approaches in both qualitative and quantitative ways.์ธ๊ฐ„์€ ์šฐ๋ฆฌ์˜ ์ผ์ƒ์ƒํ™œ์—์„œ ๊ฐ€์žฅ ์ค‘์‹ฌ์ด ๋˜๊ณ  ํฅ๋ฏธ๋กœ์šด ๋Œ€์ƒ์ด๋‹ค. ๊ทธ์— ๋”ฐ๋ผ ๋ชจ์…˜ ์บก์ฒ˜, ์ธ๊ฐ„-์ปดํ“จํ„ฐ ์ธํ„ฐ๋ ‰์…˜ ๋“ฑ ๋งŽ์€ ์ธ๊ฐ„์ค‘์‹ฌ์˜ ๊ธฐ์ˆ ๊ณผ ํ•™๋ฌธ์ด ์‚ฐ์—…๊ณ„์™€ ํ•™๊ณ„์—์„œ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ธ๊ฐ„์˜ ์ •ํ™•ํ•œ 3D ๊ธฐํ•˜ (์ฆ‰, ์ธ๊ฐ„์˜ 3D ์ž์„ธ์™€ ํ˜•ํƒœ)๋ฅผ ๋ณต์›ํ•˜๋Š” ๊ฒƒ์€ ์ธ๊ฐ„์ค‘์‹ฌ ๊ธฐ์ˆ ๊ณผ ํ•™๋ฌธ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์นด๋ฉ”๋ผ์˜ ๋น ๋ฅธ ๋Œ€์ค‘ํ™”๋กœ ์ธํ•ด ๋‹จ์ผ ์ด๋ฏธ์ง€๋Š” ๋งŽ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋„๋ฆฌ ์“ฐ์ด๋Š” ์ž…๋ ฅ์ด ๋˜์—ˆ๊ณ , ๊ทธ๋กœ ์ธํ•ด ๋งŽ์€ ๋‹จ์ผ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜์˜ 3D ์ธ๊ฐ„ ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์†๊ณผ ๋ฐœ์„ ํฌํ•จํ•œ ์ „์‹ ์˜ 3D ์ž์„ธ์™€ ํ˜•ํƒœ๋Š” ์ธ๊ฐ„์˜ ์˜๋„์™€ ๋Š๋‚Œ์„ ํฌํ•จํ•œ ํ‘œํ˜„์ ์ด๊ณ  ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ „์‹ ์˜ 3D ์ž์„ธ์™€ ํ˜•ํƒœ๋ฅผ ๋ณต์›ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ์˜ค์ง ๊ทน์†Œ์ˆ˜์˜ ๋ฐฉ๋ฒ•๋งŒ์ด ์ด๋ฅผ ํ’€๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ๊ณ , ์ด๋ฅผ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ํ‘œํ˜„์ ์ธ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ํ‘œํ˜„์ ์ธ 3D ์ž์„ธ์™€ ํ˜•ํƒœ๋ฅผ ํ•œ ๋ฒˆ์— ๋ณต์›ํ•˜๋Š” ๊ฒƒ ๋Œ€์‹ , ์‚ฌ๋žŒ์˜ ๋ชธ, ์†, ๊ทธ๋ฆฌ๊ณ  ์–ผ๊ตด์˜ 3D ์ž์„ธ์™€ ํ˜•ํƒœ๋ฅผ ๋”ฐ๋กœ ๋ณต์›ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ๋ถ€๋ถ„ ํŠน์œ  ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์˜ ๊ฐ„๋‹จํ™” ์ด์™ธ์—๋„ ๋ช‡ ๊ฐ€์ง€์˜ ๊ฐ„๋‹จํ™”๊ฐ€ ๋” ์กด์žฌํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋งŽ์€ ๋ฐฉ๋ฒ•์€ 3D ํ˜•ํƒœ๋ฅผ ์ œ์™ธํ•œ 3D ์ž์„ธ๋งŒ์„ ์ถ”์ •ํ•œ๋‹ค. ์ด๋Š” ์ถ”๊ฐ€์ ์ธ 3D ํ˜•ํƒœ ์ถ”์ •์ด ๋ฌธ์ œ๋ฅผ ๋” ์–ด๋ ต๊ฒŒ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ, ๋Œ€๋ถ€๋ถ„์˜ ๋ฐฉ๋ฒ•์€ ์˜ค์ง ๋‹จ์ผ ์‚ฌ๋žŒ์˜ ๊ฒฝ์šฐ๋งŒ ๊ณ ๋ คํ•˜๊ณ  ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์˜ ๊ฒฝ์šฐ๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ํ˜„์žฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋“ค์€ ๋ช‡ ๊ฐ€์ง€ ๊ธฐ์ค€์— ์˜ํ•ด ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ๋‹ค; 1) ๋ถ€๋ถ„ ํŠน์œ  ๋ฐฉ๋ฒ• vs. ํ‘œํ˜„์  ๋ฐฉ๋ฒ•, 2) 3D ์ž์„ธ ์ถ”์ • ๋ฐฉ๋ฒ• vs. 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ • ๋ฐฉ๋ฒ•, ๊ทธ๋ฆฌ๊ณ  3) ๋‹จ์ผ ์‚ฌ๋žŒ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ• vs. ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•. ๋ถ€๋ถ„ ํŠน์œ ์—์„œ ํ‘œํ˜„์ ์œผ๋กœ, 3D ์ž์„ธ ์ถ”์ •์—์„œ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ •์œผ๋กœ, ๋‹จ์ผ ์‚ฌ๋žŒ์—์„œ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ์ถ”์ •์ด ๋” ์–ด๋ ค์›Œ์ง€์ง€๋งŒ, ๋” ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ๋‹จ์ผ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์˜ ํ‘œํ˜„์ ์ธ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ •์„ ํ–ฅํ•˜๋Š” ์„ธ ๊ฐ€์ง€์˜ ์ ‘๊ทผ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ์ข…์ ์œผ๋กœ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๊ฐ€์žฅ ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์€ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ 3D ์ž์„ธ ์ถ”์ •์ด๊ณ , ๋‘ ๋ฒˆ์งธ๋Š” ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ •์ด๊ณ , ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰์€ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ ํ‘œํ˜„์ ์ธ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ •์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ฐ ์ ‘๊ทผ๋ฒ•์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์ด ๊ฐ€์ง„ ์ค‘์š”ํ•œ ํ•œ๊ณ„์ ๋“ค์„ ํ•ด๊ฒฐํ•˜์—ฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋“ค์ด ์‹ค์ƒํ™œ์—์„œ ์“ฐ์ผ ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์€ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ 3D ์ž์„ธ ์ถ”์ • ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค. ๋‹จ์ผ ์‚ฌ๋žŒ์˜ ๊ฒฝ์šฐ์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์˜ ๊ฒฝ์šฐ ์‚ฌ๋žŒ๋งˆ๋‹ค ์นด๋ฉ”๋ผ ์ƒ๋Œ€์ ์ธ 3D ์œ„์น˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์นด๋ฉ”๋ผ ์ƒ๋Œ€์ ์ธ 3D ์œ„์น˜๋ฅผ ๋‹จ์ผ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋†’์€ ๊นŠ์ด ๋ชจํ˜ธ์„ฑ์„ ๋™๋ฐ˜ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์‹ฌ์ธต ์ด๋ฏธ์ง€ ํ”ผ์ณ์™€ ์นด๋ฉ”๋ผ ํ•€ํ™€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์นด๋ฉ”๋ผ ์ƒ๋Œ€์ ์ธ 3D ์œ„์น˜๋ฅผ ๋ณต์›ํ•œ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์–ด๋–ค ๋‹จ์ผ ์‚ฌ๋žŒ์„ ์œ„ํ•œ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ • ๋ฐฉ๋ฒ•๊ณผ ํ•ฉ์ณ์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค์Œ์— ์†Œ๊ฐœ๋  ๋‘ ์ ‘๊ทผ๋ฒ•์€ ์˜ค์ง ๋‹จ์ผ ์‚ฌ๋žŒ์„ ์œ„ํ•œ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ •์— ์ดˆ์ ์„ ๋งž์ถ˜๋‹ค. ๋‹ค์Œ์— ์†Œ๊ฐœ๋  ๋‘ ์ ‘๊ทผ๋ฒ•์—์„œ ์ œ์•ˆ๋œ ๋‹จ์ผ ์‚ฌ๋žŒ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์—์„œ ์†Œ๊ฐœ๋˜๋Š” ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์˜ ๊ฒฝ์šฐ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์€ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ • ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์„ ํ™•์žฅํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ถ”๊ฐ€๋กœ 3D ํ˜•ํƒœ๋ฅผ ์ถ”์ •ํ•˜๊ฒŒ ํ•œ๋‹ค. ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ„ํ•ด ๋ฆญ์…€ ๊ธฐ๋ฐ˜์˜ 1D ํžˆํŠธ๋งต์„ ์ œ์•ˆํ•˜๊ณ , ์ด๋กœ ์ธํ•ด ๊ธฐ์กด์— ๋ฐœํ‘œ๋œ ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ํฐ ํญ์œผ๋กœ ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป๋Š”๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์€ ์—ฌ๋Ÿฌ ์‚ฌ๋žŒ์„ ์œ„ํ•œ ํ‘œํ˜„์ ์ธ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ ์ถ”์ • ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด๊ฒƒ์€ ๋ชธ, ์†, ๊ทธ๋ฆฌ๊ณ  ์–ผ๊ตด๋งˆ๋‹ค 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ๋ฅผ ํ•˜๋‚˜๋กœ ํ†ตํ•ฉํ•˜์—ฌ ํ‘œํ˜„์ ์ธ 3D ์ž์„ธ ๋ฐ ํ˜•ํƒœ๋ฅผ ์–ป๋Š”๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ์ด๊ฒƒ์€ 3D ์œ„์น˜ ํฌ์ฆˆ ๊ธฐ๋ฐ˜์˜ 3D ํšŒ์ „ ํฌ์ฆˆ ์ถ”์ •๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•จ์œผ๋กœ์จ ๊ธฐ์กด์— ๋ฐœํ‘œ๋œ ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป๋Š”๋‹ค. ์ œ์•ˆ๋œ ์ ‘๊ทผ๋ฒ•๋“ค์€ ๊ธฐ์กด์— ๋ฐœํ‘œ๋˜์—ˆ๋˜ ๋ฐฉ๋ฒ•๋“ค์ด ๊ฐ–๋Š” ํ•œ๊ณ„์ ๋“ค์„ ์„ฑ๊ณต์ ์œผ๋กœ ๊ทน๋ณตํ•œ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์  ๊ฒฐ๊ณผ๊ฐ€ ์ •์„ฑ์ , ์ •๋Ÿ‰์ ์œผ๋กœ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ์šฉ์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.1 Introduction 1 1.1 Background and Research Issues 1 1.2 Outline of the Dissertation 3 2 3D Multi-Person Pose Estimation 7 2.1 Introduction 7 2.2 Related works 10 2.3 Overview of the proposed model 13 2.4 DetectNet 13 2.5 PoseNet 14 2.5.1 Model design 14 2.5.2 Loss function 14 2.6 RootNet 15 2.6.1 Model design 15 2.6.2 Camera normalization 19 2.6.3 Network architecture 19 2.6.4 Loss function 20 2.7 Implementation details 20 2.8 Experiment 21 2.8.1 Dataset and evaluation metric 21 2.8.2 Experimental protocol 22 2.8.3 Ablation study 23 2.8.4 Comparison with state-of-the-art methods 25 2.8.5 Running time of the proposed framework 31 2.8.6 Qualitative results 31 2.9 Conclusion 34 3 3D Multi-Person Pose and Shape Estimation 35 3.1 Introduction 35 3.2 Related works 38 3.3 I2L-MeshNet 41 3.3.1 PoseNet 41 3.3.2 MeshNet 43 3.3.3 Final 3D human pose and mesh 45 3.3.4 Loss functions 45 3.4 Implementation details 47 3.5 Experiment 48 3.5.1 Datasets and evaluation metrics 48 3.5.2 Ablation study 50 3.5.3 Comparison with state-of-the-art methods 57 3.6 Conclusion 60 4 Expressive 3D Multi-Person Pose and Shape Estimation 63 4.1 Introduction 63 4.2 Related works 66 4.3 Pose2Pose 69 4.3.1 PositionNet 69 4.3.2 RotationNet 70 4.4 Expressive 3D human pose and mesh estimation 72 4.4.1 Body part 72 4.4.2 Hand part 73 4.4.3 Face part 73 4.4.4 Training the networks 74 4.4.5 Integration of all parts in the testing stage 74 4.5 Implementation details 77 4.6 Experiment 78 4.6.1 Training sets and evaluation metrics 78 4.6.2 Ablation study 78 4.6.3 Comparison with state-of-the-art methods 82 4.6.4 Running time 87 4.7 Conclusion 87 5 Conclusion and Future Work 89 5.1 Summary and Contributions of the Dissertation 89 5.2 Future Directions 90 5.2.1 Global Context-Aware 3D Multi-Person Pose Estimation 91 5.2.2 Unied Framework for Expressive 3D Human Pose and Shape Estimation 91 5.2.3 Enhancing Appearance Diversity of Images Captured from Multi-View Studio 92 5.2.4 Extension to the video for temporally consistent estimation 94 5.2.5 3D clothed human shape estimation in the wild 94 5.2.6 Robust human action recognition from a video 96 Bibliography 98 ๊ตญ๋ฌธ์ดˆ๋ก 111Docto

    ํ™•๋ฅ ์ ์ธ 3์ฐจ์› ์ž์„ธ ๋ณต์›๊ณผ ํ–‰๋™์ธ์‹

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2016. 2. ์˜ค์„ฑํšŒ.These days, computer vision technology becomes popular and plays an important role in intelligent systems, such as augment reality, video and image analysis, and to name a few. Although cost effective depth cameras, like a Microsoft Kinect, have recently developed, most computer vision algorithms assume that observations are obtained from RGB cameras, which make 2D observations. If, somehow, we can estimate 3D information from 2D observations, it might give better solutions for many computer vision problems. In this dissertation, we focus on estimating 3D information from 2D observations, which is well known as non-rigid structure from motion (NRSfM). More formally, NRSfM finds the three dimensional structure of an object by analyzing image streams with the assumption that an object lies in a low-dimensional space. However, a human body for long periods of time can have complex shape variations and it makes a challenging problem for NRSfM due to its increased degree of freedom. In order to handle complex shape variations, we propose a Procrustean normal distribution mixture model (PNDMM) by extending a recently proposed Procrustean normal distribution (PND), which captures the distribution of non-rigid variations of an object by excluding the effects of rigid motion. Unlike existing methods which use a single model to solve an NRSfM problem, the proposed PNDMM decomposes complex shape variations into a collection of simpler ones, thereby model learning can be more tractable and accurate. We perform experiments showing that the proposed method outperforms existing methods on highly complex and long human motion sequences. In addition, we extend the PNDMM to a single view 3D human pose estimation problem. While recovering a 3D structure of a human body from an image is important, it is a highly ambiguous problem due to the deformation of an articulated human body. Moreover, before estimating a 3D human pose from a 2D human pose, it is important to obtain an accurate 2D human pose. In order to address inaccuracy of 2D pose estimation on a single image and 3D human pose ambiguities, we estimate multiple 2D and 3D human pose candidates and select the best one which can be explained by a 2D human pose detector and a 3D shape model. We also introduce a model transformation which is incorporated into the 3D shape prior model, such that the proposed method can be applied to a novel test image. Experimental results show that the proposed method can provide good 3D reconstruction results when tested on a novel test image, despite inaccuracies of 2D part detections and 3D shape ambiguities. Finally, we handle an action recognition problem from a video clip. Current studies show that high-level features obtained from estimated 2D human poses enable action recognition performance beyond current state-of-the-art methods using low- and mid-level features based on appearance and motion, despite inaccuracy of human pose estimation. Based on these findings, we propose an action recognition method using estimated 3D human pose information since the proposed PNDMM is able to reconstruct 3D shapes from 2D shapes. Experimental results show that 3D pose based descriptors are better than 2D pose based descriptors for action recognition, regardless of classification methods. Considering the fact that we use simple 3D pose descriptors based on a 3D shape model which is learned from 2D shapes, results reported in this dissertation are promising and obtaining accurate 3D information from 2D observations is still an important research issue for reliable computer vision systems.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Issues 4 1.3 Organization of the Dissertation 6 Chapter 2 Preliminary 9 2.1 Generalized Procrustes Analysis (GPA) 11 2.2 EM-GPA Algorithm 12 2.2.1 Objective function 12 2.2.2 E-step 15 2.2.3 M-step 16 2.3 Implementation Considerations for EM-GPA 18 2.3.1 Preprocessing stage 18 2.3.2 Small update rate for the covariance matrix 20 2.4 Experiments 21 2.4.1 Shape alignment with the missing information 23 2.4.2 3D shape modeling 24 2.4.3 2D+3D active appearance models 28 2.5 Chapter Summary and Discussion 32 Chapter 3 Procrustean Normal Distribution Mixture Model 33 3.1 Non-Rigid Structure from Motion 35 3.2 Procrustean Normal Distribution (PND) 38 3.3 PND Mixture Model 41 3.4 Learning a PNDMM 43 3.4.1 E-step 44 3.4.2 M-step 46 3.5 Learning an Adaptive PNDMM 48 3.6 Experiments 50 3.6.1 Experimental setup 50 3.6.2 CMU Mocap database 53 3.6.3 UMPM dataset 69 3.6.4 Simple and short motions 74 3.6.5 Real sequence - qualitative representation 77 3.7 Chapter Summary 78 Chapter 4 Recovering a 3D Human Pose from a Novel Image 83 4.1 Single View 3D Human Pose Estimation 85 4.2 Candidate Generation 87 4.2.1 Initial pose generation 87 4.2.2 Part recombination 88 4.3 3D Shape Prior Model 89 4.3.1 Procrustean mixture model learning 89 4.3.2 Procrustean mixture model fitting 91 4.4 Model Transformation 92 4.4.1 Model normalization 92 4.4.2 Model adaptation 95 4.5 Result Selection 96 4.6 Experiments 98 4.6.1 Implementation details 98 4.6.2 Evaluation of the joint 2D and 3D pose estimation 99 4.6.3 Evaluation of the 2D pose estimation 104 4.6.4 Evaluation of the 3D pose estimation 106 4.7 Chapter Summary 108 Chapter 5 Application to Action Recognition 109 5.1 Appearance and Motion Based Descriptors 112 5.2 2D Pose Based Descriptors 113 5.3 Bag-of-Features with a Multiple Kernel Method 114 5.4 Classification - Kernel Group Sparse Representation 115 5.4.1 Group sparse representation for classification 116 5.4.2 Kernel group sparse (KGS) representation for classification 118 5.5 Experiment on sub-JHMDB Dataset 120 5.5.1 Experimental setup 120 5.5.2 3D pose based descriptor 122 5.5.3 Experimental results 123 5.6 Chapter Summary 129 Chapter 6 Conclusion and Future Work 131 Appendices 135 A Proof of Propositions in Chapter 2 137 A.1 Proof of Proposition 1 137 A.2 Proof of Proposition 3 138 A.3 Proof of Proposition 4 139 B Calculation of p(XijDii) in Chapter 3 141 B.1 Without the Dirac-delta term 141 B.2 With the Dirac-delta term 142 C Procrustean Mixture Model Learning and Fitting in Chapter 4 145 C.1 Procrustean Mixture Model Learning 145 C.2 Procrustean Mixture Model Fitting 147 Bibliography 153 ์ดˆ ๋ก 167Docto

    Reducing Training Demands for 3D Gait Recognition with Deep Koopman Operator Constraints

    Full text link
    Deep learning research has made many biometric recognition solution viable, but it requires vast training data to achieve real-world generalization. Unlike other biometric traits, such as face and ear, gait samples cannot be easily crawled from the web to form massive unconstrained datasets. As the human body has been extensively studied for different digital applications, one can rely on prior shape knowledge to overcome data scarcity. This work follows the recent trend of fitting a 3D deformable body model into gait videos using deep neural networks to obtain disentangled shape and pose representations for each frame. To enforce temporal consistency in the network, we introduce a new Linear Dynamical Systems (LDS) module and loss based on Koopman operator theory, which provides an unsupervised motion regularization for the periodic nature of gait, as well as a predictive capacity for extending gait sequences. We compare LDS to the traditional adversarial training approach and use the USF HumanID and CASIA-B datasets to show that LDS can obtain better accuracy with less training data. Finally, we also show that our 3D modeling approach is much better than other 3D gait approaches in overcoming viewpoint variation under normal, bag-carrying and clothing change conditions

    Activity Representation from Video Using Statistical Models on Shape Manifolds

    Get PDF
    Activity recognition from video data is a key computer vision problem with applications in surveillance, elderly care, etc. This problem is associated with modeling a representative shape which contains significant information about the underlying activity. In this dissertation, we represent several approaches for view-invariant activity recognition via modeling shapes on various shape spaces and Riemannian manifolds. The first two parts of this dissertation deal with activity modeling and recognition using tracks of landmark feature points. The motion trajectories of points extracted from objects involved in the activity are used to build deformation shape models for each activity, and these models are used for classification and detection of unusual activities. In the first part of the dissertation, these models are represented by the recovered 3D deformation basis shapes corresponding to the activity using a non-rigid structure from motion formulation. We use a theory for estimating the amount of deformation for these models from the visual data. We study the special case of ground plane activities in detail because of its importance in video surveillance applications. In the second part of the dissertation, we propose to model the activity by learning an affine invariant deformation subspace representation that captures the space of possible body poses associated with the activity. These subspaces can be viewed as points on a Grassmann manifold. We propose several statistical classification models on Grassmann manifold that capture the statistical variations of the shape data while following the intrinsic Riemannian geometry of these manifolds. The last part of this dissertation addresses the problem of recognizing human gestures from silhouette images. We represent a human gesture as a temporal sequence of human poses, each characterized by a contour of the associated human silhouette. The shape of a contour is viewed as a point on the shape space of closed curves and, hence, each gesture is characterized and modeled as a trajectory on this shape space. We utilize the Riemannian geometry of this space to propose a template-based and a graphical-based approaches for modeling these trajectories. The two models are designed in such a way to account for the different invariance requirements in gesture recognition, and also capture the statistical variations associated with the contour data

    Shape/image registration for medical imaging : novel algorithms and applications.

    Get PDF
    This dissertation looks at two different categories of the registration approaches: Shape registration, and Image registration. It also considers the applications of these approaches into the medical imaging field. Shape registration is an important problem in computer vision, computer graphics and medical imaging. It has been handled in different manners in many applications like shapebased segmentation, shape recognition, and tracking. Image registration is the process of overlaying two or more images of the same scene taken at different times, from different viewpoints, and/or by different sensors. Many image processing applications like remote sensing, fusion of medical images, and computer-aided surgery need image registration. This study deals with two different applications in the field of medical image analysis. The first one is related to shape-based segmentation of the human vertebral bodies (VBs). The vertebra consists of the VB, spinous, and other anatomical regions. Spinous pedicles, and ribs should not be included in the bone mineral density (BMD) measurements. The VB segmentation is not an easy task since the ribs have similar gray level information. This dissertation investigates two different segmentation approaches. Both of them are obeying the variational shape-based segmentation frameworks. The first approach deals with two dimensional (2D) case. This segmentation approach starts with obtaining the initial segmentation using the intensity/spatial interaction models. Then, shape model is registered to the image domain. Finally, the optimal segmentation is obtained using the optimization of an energy functional which integrating the shape model with the intensity information. The second one is a 3D simultaneous segmentation and registration approach. The information of the intensity is handled by embedding a Willmore flow into the level set segmentation framework. Then the shape variations are estimated using a new distance probabilistic model. The experimental results show that the segmentation accuracy of the framework are much higher than other alternatives. Applications on BMD measurements of vertebral body are given to illustrate the accuracy of the proposed segmentation approach. The second application is related to the field of computer-aided surgery, specifically on ankle fusion surgery. The long-term goal of this work is to apply this technique to ankle fusion surgery to determine the proper size and orientation of the screws that are used for fusing the bones together. In addition, we try to localize the best bone region to fix these screws. To achieve these goals, the 2D-3D registration is introduced. The role of 2D-3D registration is to enhance the quality of the surgical procedure in terms of time and accuracy, and would greatly reduce the need for repeated surgeries; thus, saving the patients time, expense, and trauma
    • โ€ฆ
    corecore