46 research outputs found

    Visual-Semantic Learning

    Get PDF
    Visual-semantic learning is an attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i.e., visual signals (i.e., images and videos) and natural language (i.e., captions and questions). It requires memorizing the rich information in a single modality and a joint comprehension of multiple modalities. Artificial intelligence (AI) systems with human-level intelligence are claimed to learn like humans, such as efficiently leveraging brain memory for better comprehension, rationally incorporating common-sense knowledge into reasoning, quickly gaining in-depth understanding given a few samples, and analyzing relationships among abundant and informative events. However, these intelligence capacities are effortless for humans but challenging for machines. To bridge the discrepancy between human-level intelligence and present-day visual-semantic learning, we start from its basic understanding ability by studying the visual question answering (e.g., Image-QA and Video-QA) tasks from the perspectives of memory augmentation and common-sense knowledge incorporation. Furthermore, we stretch it to a more challenging situation with limited and partially unlabeled training data (i.e., Few-shot Visual-Semantic Learning) to imitate the fast learning ability of humans. Finally, to further enhance visual-semantic performance in natural videos with numerous spatio-temporal dynamics, we investigate exploiting event-correlated information for a comprehensive understanding of cross-modal semantics. To study the essential visual-semantic understanding ability of the human brain with memory, we first propose a novel Memory Augmented Deep Recurrent Neural Network (i.e., MA-DRNN) model for Video-QA, which features a new method for encoding videos and questions, and memory augmentation using the emerging Differentiable Neural Computer (i.e., DNC). Specifically, we encode semantic (i.e., questions) information before visual (i.e., videos) information, which leads to better visual-semantic representations. Moreover, we leverage Differentiable Neural Computer (with external memory) to store and retrieve valuable information in questions and videos and model the long-term visual-semantic dependency. In addition to basic understanding, to tackle visual-semantic reasoning that requires external knowledge beyond visible contents (e.g., KB-Image-QA), we propose a novel framework that endows the model with capabilities of answering more general questions and achieves better exploitation of external knowledge through generating Multiple Clues for Reasoning with Memory Neural Networks (i.e., MCR-MemNN). Specifically, a well-defined detector is adopted to predict image-question-related relation phrases, each delivering two complementary clues to retrieve the supporting facts from an external knowledge base (i.e., KB). These facts are encoded into a continuous embedding space using a content-addressable memory. Afterward, mutual interactions between visual-semantic representation and the supporting facts stored in memory are captured to distill the most relevant information in three modalities (i.e., image, question, and KB). Finally, the optimal answer is predicted by choosing the supporting fact with the highest score. Furthermore, to enable a fast, in-depth understanding given a small number of samples, especially with heterogeneity in the multi-modal scenarios such as image question answering (i.e., Image-QA) and image captioning (i.e., IC), we study the few-shot visual-semantic learning and present the Hierarchical Graph ATtention Network (i.e., HGAT). This two-stage network models the intra- and inter-modal relationships with limited image-text samples. The main contributions of HGAT can be summarized as follows: 1) it sheds light on tackling few-shot multi-modal learning problems, which focuses primarily, but not exclusively, on visual and semantic modalities, through better exploitation of the intra-relationship of each modality and an attention-based co-learning framework between modalities using a hierarchical graph-based architecture; 2) it achieves superior performance on both visual question answering and image captioning in the few-shot setting; 3) it can be easily extended to the semi-supervised setting where image-text samples are partially unlabeled. Although various attention mechanisms have been utilized to manage contextualized representations by modeling intra- and inter-modal relationships of the two modalities, one limitation of the predominant visual-semantic methods is the lack of reasoning with event correlation, sensing, and analyzing relationships among abundant and informative events contained in the video. To this end, we introduce the dense caption modality as a new auxiliary and distill event-correlated information to infer the correct answer. We propose a novel end-to-end trainable model, Event-Correlated Graph Neural Networks (EC-GNNs), to perform cross-modal reasoning over information from the three modalities (i.e., caption, video, and question). Besides exploiting a new modality, we employ cross-modal reasoning modules to explicitly model inter-modal relationships and aggregate relevant information across different modalities. We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning. To evaluate our proposed models, we conduct extensive experiments on VTW, MSVD-QA, and TGIF-QA datasets for Video-QA task, Toronto COCO-QA, Visual Genome-QA datasets for few-shot Image-QA task, COCO-FITB dataset for few-shot IC task, and FVQA, Visual7W + ConceptNet datasets for KB-Image-QA task. The experimental results justify these modelsโ€™ effectiveness and superiority over baseline methods

    ๋™์˜์ƒ ์† ์‚ฌ๋žŒ ๋™์ž‘์˜ ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์žฌ๊ตฌ์„ฑ ๋ฐ ๋ถ„์„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ์ด์ œํฌ.In computer graphics, simulating and analyzing human movement have been interesting research topics started since the 1960s. Still, simulating realistic human movements in a 3D virtual world is a challenging task in computer graphics. In general, motion capture techniques have been used. Although the motion capture data guarantees realistic result and high-quality data, there is lots of equipment required to capture motion, and the process is complicated. Recently, 3D human pose estimation techniques from the 2D video are remarkably developed. Researchers in computer graphics and computer vision have attempted to reconstruct the various human motions from video data. However, existing methods can not robustly estimate dynamic actions and not work on videos filmed with a moving camera. In this thesis, we propose methods to reconstruct dynamic human motions from in-the-wild videos and to control the motions. First, we developed a framework to reconstruct motion from videos using prior physics knowledge. For dynamic motions such as backspin, the poses estimated by a state-of-the-art method are incomplete and include unreliable root trajectory or lack intermediate poses. We designed a reward function using poses and hints extracted from videos in the deep reinforcement learning controller and learned a policy to simultaneously reconstruct motion and control a virtual character. Second, we simulated figure skating movements in video. Skating sequences consist of fast and dynamic movements on ice, hindering the acquisition of motion data. Thus, we extracted 3D key poses from a video to then successfully replicate several figure skating movements using trajectory optimization and a deep reinforcement learning controller. Third, we devised an algorithm for gait analysis through video of patients with movement disorders. After acquiring the patients joint positions from 2D video processed by a deep learning network, the 3D absolute coordinates were estimated, and gait parameters such as gait velocity, cadence, and step length were calculated. Additionally, we analyzed the optimization criteria of human walking by using a 3D musculoskeletal humanoid model and physics-based simulation. For two criteria, namely, the minimization of muscle activation and joint torque, we compared simulation data with real human data for analysis. To demonstrate the effectiveness of the first two research topics, we verified the reconstruction of dynamic human motions from 2D videos using physics-based simulations. For the last two research topics, we evaluated our results with real human data.์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ์Šค์—์„œ ์ธ๊ฐ„์˜ ์›€์ง์ž„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ๋ถ„์„์€ 1960 ๋…„๋Œ€๋ถ€ํ„ฐ ๋‹ค๋ฃจ์–ด์ง„ ํฅ๋ฏธ๋กœ์šด ์—ฐ๊ตฌ ์ฃผ์ œ์ด๋‹ค. ๋ช‡ ์‹ญ๋…„ ๋™์•ˆ ํ™œ๋ฐœํ•˜๊ฒŒ ์—ฐ๊ตฌ๋˜์–ด ์™”์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , 3์ฐจ์› ๊ฐ€์ƒ ๊ณต๊ฐ„ ์ƒ์—์„œ ์‚ฌ์‹ค์ ์ธ ์ธ๊ฐ„์˜ ์›€์ง์ž„์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ์—ฐ๊ตฌ๋Š” ์—ฌ์ „ํžˆ ์–ด๋ ต๊ณ  ๋„์ „์ ์ธ ์ฃผ์ œ์ด๋‹ค. ๊ทธ๋™์•ˆ ์‚ฌ๋žŒ์˜ ์›€์ง์ž„ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด์„œ ๋ชจ์…˜ ์บก์ณ ๊ธฐ์ˆ ์ด ์‚ฌ์šฉ๋˜์–ด ์™”๋‹ค. ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ๋Š” ์‚ฌ์‹ค์ ์ธ ๊ฒฐ๊ณผ์™€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์žฅํ•˜์ง€๋งŒ ๋ชจ์…˜ ์บก์ณ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•„์š”ํ•œ ์žฅ๋น„๋“ค์ด ๋งŽ๊ณ , ๊ทธ ๊ณผ์ •์ด ๋ณต์žกํ•˜๋‹ค. ์ตœ๊ทผ์— 2์ฐจ์› ์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ์‚ฌ๋žŒ์˜ 3์ฐจ์› ์ž์„ธ๋ฅผ ์ถ”์ •ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ๊ด„๋ชฉํ•  ๋งŒํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ์Šค์™€ ์ปดํ“จํ„ฐ ๋น„์ ผ ๋ถ„์•ผ์˜ ์—ฐ๊ตฌ์ž๋“ค์€ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋‹ค์–‘ํ•œ ์ธ๊ฐ„ ๋™์ž‘์„ ์žฌ๊ตฌ์„ฑํ•˜๋ ค๋Š” ์‹œ๋„๋ฅผ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋“ค์€ ๋น ๋ฅด๊ณ  ๋‹ค์ด๋‚˜๋ฏนํ•œ ๋™์ž‘๋“ค์€ ์•ˆ์ •์ ์œผ๋กœ ์ถ”์ •ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ ์›€์ง์ด๋Š” ์นด๋ฉ”๋ผ๋กœ ์ดฌ์˜ํ•œ ๋น„๋””์˜ค์— ๋Œ€ํ•ด์„œ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์—ญ๋™์ ์ธ ์ธ๊ฐ„ ๋™์ž‘์„ ์žฌ๊ตฌ์„ฑํ•˜๊ณ  ๋™์ž‘์„ ์ œ์–ดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € ์‚ฌ์ „ ๋ฌผ๋ฆฌํ•™ ์ง€์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๋””์˜ค์—์„œ ๋ชจ์…˜์„ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ํ”„๋ ˆ์ž„ ์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ณต์ค‘์ œ๋น„์™€ ๊ฐ™์€ ์—ญ๋™์ ์ธ ๋™์ž‘๋“ค์— ๋Œ€ํ•ด์„œ ์ตœ์‹  ์—ฐ๊ตฌ ๋ฐฉ๋ฒ•์„ ๋™์›ํ•˜์—ฌ ์ถ”์ •๋œ ์ž์„ธ๋“ค์€ ์บ๋ฆญํ„ฐ์˜ ๊ถค์ ์„ ์‹ ๋ขฐํ•  ์ˆ˜ ์—†๊ฑฐ๋‚˜ ์ค‘๊ฐ„์— ์ž์„ธ ์ถ”์ •์— ์‹คํŒจํ•˜๋Š” ๋“ฑ ๋ถˆ์™„์ „ํ•˜๋‹ค. ์šฐ๋ฆฌ๋Š” ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต ์ œ์–ด๊ธฐ์—์„œ ์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ ํฌ์ฆˆ์™€ ํžŒํŠธ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ๋ชจ์…˜ ์žฌ๊ตฌ์„ฑ๊ณผ ์บ๋ฆญํ„ฐ ์ œ์–ด๋ฅผ ๋™์‹œ์— ํ•˜๋Š” ์ •์ฑ…์„ ํ•™์Šตํ•˜์˜€๋‹ค. ๋‘˜ ์งธ, ๋น„๋””์˜ค์—์„œ ํ”ผ๊ฒจ ์Šค์ผ€์ดํŒ… ๊ธฐ์ˆ ์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•œ๋‹ค. ํ”ผ๊ฒจ ์Šค์ผ€์ดํŒ… ๊ธฐ์ˆ ๋“ค์€ ๋น™์ƒ์—์„œ ๋น ๋ฅด๊ณ  ์—ญ๋™์ ์ธ ์›€์ง์ž„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์–ด ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ธฐ๊ฐ€ ๊นŒ๋‹ค๋กญ๋‹ค. ๋น„๋””์˜ค์—์„œ 3์ฐจ์› ํ‚ค ํฌ์ฆˆ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๊ถค์  ์ตœ์ ํ™” ๋ฐ ์‹ฌ์ธต๊ฐ•ํ™”ํ•™์Šต ์ œ์–ด๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ํ”ผ๊ฒจ ์Šค์ผ€์ดํŒ… ๊ธฐ์ˆ ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์‹œ์—ฐํ•œ๋‹ค. ์…‹ ์งธ, ํŒŒํ‚จ์Šจ ๋ณ‘์ด๋‚˜ ๋‡Œ์„ฑ๋งˆ๋น„์™€ ๊ฐ™์€ ์งˆ๋ณ‘์œผ๋กœ ์ธํ•˜์—ฌ ์›€์ง์ž„ ์žฅ์• ๊ฐ€ ์žˆ๋Š” ํ™˜์ž์˜ ๋ณดํ–‰์„ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. 2์ฐจ์› ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ๋”ฅ๋Ÿฌ๋‹์„ ์‚ฌ์šฉํ•œ ์ž์„ธ ์ถ”์ •๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํ™˜์ž์˜ ๊ด€์ ˆ ์œ„์น˜๋ฅผ ์–ป์–ด๋‚ธ ๋‹ค์Œ, 3์ฐจ์› ์ ˆ๋Œ€ ์ขŒํ‘œ๋ฅผ ์–ป์–ด๋‚ด์–ด ์ด๋กœ๋ถ€ํ„ฐ ๋ณดํญ, ๋ณดํ–‰ ์†๋„์™€ ๊ฐ™์€ ๋ณดํ–‰ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๊ทผ๊ณจ๊ฒฉ ์ธ์ฒด ๋ชจ๋ธ๊ณผ ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ด์šฉํ•˜์—ฌ ์ธ๊ฐ„ ๋ณดํ–‰์˜ ์ตœ์ ํ™” ๊ธฐ์ค€์— ๋Œ€ํ•ด ํƒ๊ตฌํ•œ๋‹ค. ๊ทผ์œก ํ™œ์„ฑ๋„ ์ตœ์†Œํ™”์™€ ๊ด€์ ˆ ๋Œ๋ฆผํž˜ ์ตœ์†Œํ™”, ๋‘ ๊ฐ€์ง€ ๊ธฐ์ค€์— ๋Œ€ํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•œ ํ›„, ์‹ค์ œ ์‚ฌ๋žŒ ๋ฐ์ดํ„ฐ์™€ ๋น„๊ตํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ํ•œ๋‹ค. ์ฒ˜์Œ ๋‘ ๊ฐœ์˜ ์—ฐ๊ตฌ ์ฃผ์ œ์˜ ํšจ๊ณผ๋ฅผ ์ž…์ฆํ•˜๊ธฐ ์œ„ํ•ด, ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ฐจ์› ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์žฌ๊ตฌ์„ฑํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์—ญ๋™์ ์ธ ์‚ฌ๋žŒ์˜ ๋™์ž‘๋“ค์„ ์žฌํ˜„ํ•œ๋‹ค. ๋‚˜์ค‘ ๋‘ ๊ฐœ์˜ ์—ฐ๊ตฌ ์ฃผ์ œ๋Š” ์‚ฌ๋žŒ ๋ฐ์ดํ„ฐ์™€์˜ ๋น„๊ต ๋ถ„์„์„ ํ†ตํ•˜์—ฌ ํ‰๊ฐ€ํ•œ๋‹ค.1 Introduction 1 2 Background 9 2.1 Pose Estimation from 2D Video . . . . . . . . . . . . . . . . . . . . 9 2.2 Motion Reconstruction from Monocular Video . . . . . . . . . . . . 10 2.3 Physics-Based Character Simulation and Control . . . . . . . . . . . 12 2.4 Motion Reconstruction Leveraging Physics . . . . . . . . . . . . . . 13 2.5 Human Motion Control . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Figure Skating Simulation . . . . . . . . . . . . . . . . . . . 16 2.6 Objective Gait Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.7 Optimization for Human Movement Simulation . . . . . . . . . . . . 17 2.7.1 Stability Criteria . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Human Dynamics from Monocular Video with Dynamic Camera Movements 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Pose and Contact Estimation . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Learning Human Dynamics . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.3 Scene Estimator . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.1 Video Clips . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.2 Comparison of Contact Estimators . . . . . . . . . . . . . . . 33 3.5.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4 Figure Skating Simulation from Video 42 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Skating Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 Non-holonomic Constraints . . . . . . . . . . . . . . . . . . 46 4.3.2 Relaxation of Non-holonomic Constraints . . . . . . . . . . . 47 4.4 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5 Trajectory Optimization and Control . . . . . . . . . . . . . . . . . . 50 4.5.1 Trajectory Optimization . . . . . . . . . . . . . . . . . . . . 50 4.5.2 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 Gait Analysis Using Pose Estimation Algorithm with 2D-video of Patients 61 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.1 Patients and video recording . . . . . . . . . . . . . . . . . . 63 5.2.2 Standard protocol approvals, registrations, and patient consents 66 5.2.3 3D Pose estimation from 2D video . . . . . . . . . . . . . . . 66 5.2.4 Gait parameter estimation . . . . . . . . . . . . . . . . . . . 67 5.2.5 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Validation of video-based analysis of the gait . . . . . . . . . 68 5.3.2 gait analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.1 Validation with the conventional sensor-based method . . . . 75 5.4.2 Analysis of gait and turning in TUG . . . . . . . . . . . . . . 75 5.4.3 Correlation with clinical parameters . . . . . . . . . . . . . . 76 5.4.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Control Optimization of Human Walking 80 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.1 Musculoskeletal model . . . . . . . . . . . . . . . . . . . . . 82 6.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.3 Control co-activation level . . . . . . . . . . . . . . . . . . . 83 6.2.4 Push-recovery experiment . . . . . . . . . . . . . . . . . . . 84 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7 Conclusion 90 7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Docto

    On Action Quality Assessment

    Full text link
    In this dissertation, we tackle the task of quantifying the quality of actions, i.e., how well an action was performed using computer vision. Existing methods used human body pose-based features to express the quality contained in an action sample. Human body pose estimation in actions such as sports actions, like diving and gymnastic vault, is particularly challenging, since the athletes undergo convoluted transformations while performing their routines. Moreover, pose-based features do not take into account visual cues such as water splash in diving. Visual cues are taken into account by human judges. In our first work, we show that using visual representation -- spatiotemporal features computed using a 3D convolutional neural network -- is more suitable as those attend to appearance and salient motion patterns of the athlete\u27s performance. Along with developing three action quality assessment (AQA) frameworks, we also compile a diving and gymnastic vault dataset. Rather, learning an action-specific model, in our second work, we show that learning to assess the quality of multiple actions jointly is more efficient as it can exploit shared/common elements of quality among different actions. All-action modeling better uses the data, shows better generalization, and adaptation to unseen/novel action classes. Taking inspiration from the \u27learning by teaching\u27 method, we propose to take multitask learning (MTL) approach to AQA, unlike existing approaches, which follow single task learning (STL) paradigm. In our MTL approach we force the network to delineate the action sample -- recognize the action in detail, and commentate on good and bad points of the performance, in addition to the main task of AQA scoring. Through this better characterization of action sample, we are able to obtain state-of-the-art results on the task of AQA. To enable our MTL approach, we also released the largest multitask AQA dataset, MTL-AQA

    Deep learning that scales: leveraging compute and data

    Get PDF
    Deep learning has revolutionized the field of artificial intelligence in the past decade. Although the development of these techniques spans over several years, the recent advent of deep learning is explained by an increased availability of data and compute that have unlocked the potential of deep neural networks. They have become ubiquitous in domains such as natural language processing, computer vision, speech processing, and control, where enough training data is available. Recent years have seen continuous progress driven by ever-growing neural networks that benefited from large amounts of data and computing power. This thesis is motivated by the observation that scale is one of the key factors driving progress in deep learning research, and aims at devising deep learning methods that scale gracefully with the available data and compute. We narrow down this scope into two main research directions. The first of them is concerned with designing hardware-aware methods which can make the most of the computing resources in current high performance computing facilities. We then study bottlenecks preventing existing methods from scaling up as more data becomes available, providing solutions that contribute towards enabling training of more complex models. This dissertation studies the aforementioned research questions for two different learning paradigms, each with its own algorithmic and computational characteristics. The first part of this thesis studies the paradigm where the model needs to learn from a collection of examples, extracting as much information as possible from the given data. The second part is concerned with training agents that learn by interacting with a simulated environment, which introduces unique challenges such as efficient exploration and simulation

    Multi-Modality Human Action Recognition

    Get PDF
    Human action recognition is very useful in many applications in various areas, e.g. video surveillance, HCI (Human computer interaction), video retrieval, gaming and security. Recently, human action recognition becomes an active research topic in computer vision and pattern recognition. A number of action recognition approaches have been proposed. However, most of the approaches are designed on the RGB images sequences, where the action data was collected by RGB/intensity camera. Thus the recognition performance is usually related to various occlusion, background, and lighting conditions of the image sequences. If more information can be provided along with the image sequences, more data sources other than the RGB video can be utilized, human actions could be better represented and recognized by the designed computer vision system.;In this dissertation, the multi-modality human action recognition is studied. On one hand, we introduce the study of multi-spectral action recognition, which involves the information from different spectrum beyond visible, e.g. infrared and near infrared. Action recognition in individual spectra is explored and new methods are proposed. Then the cross-spectral action recognition is also investigated and novel approaches are proposed in our work. On the other hand, since the depth imaging technology has made a significant progress recently, where depth information can be captured simultaneously with the RGB videos. The depth-based human action recognition is also investigated. I first propose a method combining different type of depth data to recognize human actions. Then a thorough evaluation is conducted on spatiotemporal interest point (STIP) based features for depth-based action recognition. Finally, I advocate the study of fusing different features for depth-based action analysis. Moreover, human depression recognition is studied by combining facial appearance model as well as facial dynamic model

    Image Manipulation and Image Synthesis

    Get PDF
    Image manipulation is of historic importance. Ever since the advent of photography, pictures have been manipulated for various reasons. Historic rulers often used image manipulation techniques for the purpose of self-portrayal or propaganda. In many cases, the goal is to manipulate human behaviour by spreading credible misinformation. Photographs, by their nature, portray the real world and as such are more credible to humans. However, image manipulation may not only serve evil purposes. In this thesis, we propose and analyse methods for image manipulation that serve a positive purpose. Specifically, we treat image manipulation as a tool for solving other tasks. For this, we model image manipulation as an image-to-image translation (I2I) task, i.e., a system that receives an image as input and outputs a manipulated version of the input. We propose multiple I2I based methods. We demonstrate that I2I based image manipulation methods can be used to reduce motion blur in videos. Second, we show that I2I based image manipulation methods can be used for domain adaptation and domain extension. Specifically, we present a method that significantly improves the learning of semantic segmentation from synthetic source data. The same technique can be applied to learning nighttime semantic segmentation from daylight images. Next, we show that I2I can be used to enable weakly supervised object segmentation. We show that each individual task requires and allows for different levels of supervision during the training of deep models in order to achieve best performance. We discuss the importance of maintaining control over the output of such methods and show that, with reduced levels of supervision, methods for maintaining stability during training and for establishing control over the output of a system become increasingly important. We propose multiple methods that solve the issues that arise in such systems. Finally, we demonstrate that our proposed mechanisms for control can be adapted to synthesise images from scratch

    Natural Language Processing for Motivational Interviewing Counselling: Addressing Challenges in Resources, Benchmarking and Evaluation

    Get PDF
    Motivational interviewing (MI) is a counselling style often used in healthcare to improve patient health and quality of life by promoting positive behaviour changes. Natural language processing (NLP) has been explored for supporting MI use cases of insights/feedback generation and therapist training, such as automatically assigning behaviour labels to therapist/client utterances and generating possible therapist responses. Despite the progress of NLP for MI applications, significant challenges remain. The most prominent one is the lack of publicly available and annotated MI dialogue corpora due to privacy constraints. Consequently, there is also a lack of common benchmarks and poor reproducibility across studies. Furthermore, human evaluation for therapist response generation is expensive and difficult to scale due to its dependence on MI experts as evaluators. In this thesis, we address these challenges in 4 directions: low-resource NLP modelling, MI dialogue dataset creation, benchmark development for real-world applicable tasks, and laypeople-experts human evaluation study. First, we explore zero-shot binary empathy assessment at the utterance level. We experiment with a supervised approach that trains on heuristically constructed empathy vs. non-empathy contrast in non-therapy dialogues. While this approach has better performance than other models without empathy-aware training, it is still suboptimal and therefore highlights the need for a well-annotated MI dataset. Next, we create AnnoMI, the first publicly available dataset of expert-annotated MI dialogues. It contains MI conversations that demonstrate both high- and low-quality counselling, with extensive annotations by domain experts covering key MI attributes. We also conduct comprehensive analyses of the dataset. Then, we investigate two AnnoMI-based real-world applicable tasks: predicting current-turn therapist/client behaviour given the utterance, and forecasting next-turn therapist behaviour given the dialogue history. We find that language models (LMs) perform well on predicting therapist behaviours with good generalisability to new dialogue topics. However, LMs have suboptimal forecasting performance, which reflects therapists' flexibility where multiple optimal next-turn actions are possible. Lastly, we ask both laypeople and experts to evaluate the generation of a crucial type of therapist responses -- reflection -- on a key quality aspect: coherence and context-consistency. We find that laypeople are a viable alternative to experts, as laypeople show good agreement with each other and correlation with experts. We also find that a large LM generates mostly coherent and consistent reflections. Overall, the work of this thesis broadens access to NLP for MI significantly as well as presents a wide range of findings on related natural language understanding/generation tasks with a real-world focus. Thus, our contributions lay the groundwork for the broader NLP community to be more engaged in research for MI, which will ultimately improve the quality of life for recipients of MI counselling

    From Anecdotal Evidence to Quantitative Evaluation Methods:A Systematic Review on Evaluating Explainable AI

    Get PDF
    The rising popularity of explainable artificial intelligence (XAI) to understand high-performing black boxes, also raised the question of how to evaluate explanations of machine learning (ML) models. While interpretability and explainability are often presented as a subjectively validated binary property, we consider it a multi-faceted concept. We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation. Our so-called Co-12 properties serve as categorization scheme for systematically reviewing the evaluation practice of more than 300 papers published in the last 7 years at major AI and ML conferences that introduce an XAI method. We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users. We also contribute to the call for objective, quantifiable evaluation methods by presenting an extensive overview of quantitative XAI evaluation methods. This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods. This also opens up opportunities to include quantitative metrics as optimization criteria during model training in order to optimize for accuracy and interpretability simultaneously.Comment: Link to website added: https://utwente-dmb.github.io/xai-papers

    Latent variable methods for visualization through time

    Get PDF
    corecore