65 research outputs found

    Deep Learning for Head Pose Estimation: A Survey

    Get PDF
    Head pose estimation (HPE) is an active and popular area of research. Over the years, many approaches have constantly been developed, leading to a progressive improvement in accuracy; nevertheless, head pose estimation remains an open research topic, especially in unconstrained environments. In this paper, we will review the increasing amount of available datasets and the modern methodologies used to estimate orientation, with a special attention to deep learning techniques. We will discuss the evolution of the feld by proposing a classifcation of head pose estimation methods, explaining their advantages and disadvantages, and highlighting the diferent ways deep learning techniques have been used in the context of HPE. An in-depth performance comparison and discussion is presented at the end of the work. We also highlight the most promising research directions for future investigations on the topic

    Toward human-centered automated driving: a novel spatial-temporal vision transformer-enabled head tracker

    Get PDF
    Accurate dynamic driver head pose tracking is of great importance for driverโ€“automotive collaboration, intelligent copilot, head-up display (HUD), and other human-centered automated driving applications. To further advance this technology, this article proposes a low-cost and markerless headtracking system using a deep learning-based dynamic head pose estimation model. The proposed system requires only a red, green, blue (RGB) camera without other hardware or markers. To enhance the accuracy of the driverโ€™s head pose estimation, a spatiotemporal vision transformer (ST-ViT) model, which takes an image pair as the input instead of a single frame, is proposed. Compared to a standard transformer, the ST-ViT contains a spatialโ€“convolutional vision transformer and a temporal transformer, which can improve the model performance. To handle the error fluctuation of the head pose estimation model, this article proposes an adaptive Kalman filter (AKF). By analyzing the error distribution of the estimation model and the user experience of the head tracker, the proposed AKF includes an adaptive observation noise coefficient; this can adaptively moderate the smoothness of the curve. Comprehensive experiments show that the proposed system is feasible and effective, and it achieves a state-of-the-art performance.Agency for Science, Technology and Research (A*STAR)Nanyang Technological UniversityThis work was supported in part by in part by the A*STAR National Robotics Program under grant W1925d0046, the Start-Up Grant, Nanyang Assistant Professorship under grant M4082268.050, Nanyang Technological University, Singapore, and the State Key Laboratory of Automotive Safety and Energy under project KF2021

    Pixel-Level Deep Multi-Dimensional Embeddings for Homogeneous Multiple Object Tracking

    Get PDF
    The goal of Multiple Object Tracking (MOT) is to locate multiple objects and keep track of their individual identities and trajectories given a sequence of (video) frames. A popular approach to MOT is tracking by detection consisting of two processing components: detection (identification of objects of interest in individual frames) and data association (connecting data from multiple frames). This work addresses the detection component by introducing a method based on semantic instance segmentation, i.e., assigning labels to all visible pixels such that they are unique among different instances. Modern tracking methods often built around Convolutional Neural Networks (CNNs) and additional, explicitly-defined post-processing steps. This work introduces two detection methods that incorporate multi-dimensional embeddings. We train deep CNNs to produce easily-clusterable embeddings for semantic instance segmentation and to enable object detection through pose estimation. The use of embeddings allows the method to identify per-pixel instance membership for both tasks. Our method specifically targets applications that require long-term tracking of homogeneous targets using a stationary camera. Furthermore, this method was developed and evaluated on a livestock tracking application which presents exceptional challenges that generalized tracking methods are not equipped to solve. This is largely because contemporary datasets for multiple object tracking lack properties that are specific to livestock environments. These include a high degree of visual similarity between targets, complex physical interactions, long-term inter-object occlusions, and a fixed-cardinality set of targets. For the reasons stated above, our method is developed and tested with the livestock application in mind and, specifically, group-housed pigs are evaluated in this work. Our method reliably detects pigs in a group housed environment based on the publicly available dataset with 99% precision and 95% using pose estimation and achieves 80% accuracy when using semantic instance segmentation at 50% IoU threshold. Results demonstrate our method\u27s ability to achieve consistent identification and tracking of group-housed livestock, even in cases where the targets are occluded and despite the fact that they lack uniquely identifying features. The pixel-level embeddings used by the proposed method are thoroughly evaluated in order to demonstrate their properties and behaviors when applied to real data. Adivser: Lance C. Pรฉre

    ๊ตฌ์กฐ ๊ฐ์‘ํ˜• ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•๊ณผ ํ˜ผํ•ฉ ๋ฐ€๋„ ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ๋ผ์ด๋‹ค ๊ธฐ๋ฐ˜ 3์ฐจ์› ๊ฐ์ฒด ๊ฒ€์ถœ ๊ฐœ์„ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€, 2023. 2. ๊ณฝ๋…ธ์ค€.์ž์œจ์ฃผํ–‰์ž๋™์ฐจ, ๋กœ๋ด‡์˜ ์ธ์‹ ์žฅ๋น„๋กœ ๋งŽ์ด ํ™œ์šฉ๋˜๊ณ ์žˆ๋Š” ๋ผ์ด๋‹ค (LiDAR) ๋Š” ๋ ˆ์ด์ € ํŽ„์Šค๋ฅผ ๋ฐฉ์ถœํ•˜์—ฌ ๋˜๋Œ์•„์˜ค๋Š” ์‹œ๊ฐ„์„ ๊ณ„์‚ฐํ•˜์—ฌ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ (point cloud) ํ˜•ํƒœ๋กœ ์ฃผ๋ณ€ ํ™˜๊ฒฝ์„ ๊ฐ์ง€ํ•œ๋‹ค. ์ฃผ๋ณ€ ํ™˜๊ฒฝ์„ ๊ฐ์ง€ํ• ๋•Œ ๊ฐ€์žฅ ์ค‘ ์š”ํ•œ ๋ถ€๋ถ„์€ ๊ทผ์ฒ˜์— ์–ด๋–ค ๊ฐ์ฒด๊ฐ€ ์žˆ๋Š”์ง€, ์–ด๋””์— ์œ„์น˜ํ•ด ์žˆ๋Š”์ง€๋ฅผ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์ด๊ณ  ์ด๋Ÿฌํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋ฅผ ํ™œ์šฉํ•˜๋Š” 3์ฐจ์› ๊ฐ ์ฒด ๊ฒ€์ถœ ๊ธฐ์ˆ ๋“ค์ด ๋งŽ์ด ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค. ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ๋ฐ์ดํ„ฐ์˜ ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์— ๋”ฐ๋ผ ๋งค์šฐ ๋‹ค์–‘ํ•œ ๊ตฌ์กฐ์˜ ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ (backbone network) ๊ฐ€ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค. ๊ณ ๋„ํ™”๋œ ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ๋“ค๋กœ ์ธํ•ด ์ธ์‹ ์„ฑ๋Šฅ์— ํฐ ๋ฐœ์ „์„ ์ด๋ฃจ์—ˆ์ง€๋งŒ, ์ด๋“ค์˜ ํ˜•ํƒœ๊ฐ€ ํฌ๊ฒŒ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋กœ ํ˜ธํ™˜์„ฑ์ด ๋ถ€์กฑํ•˜์—ฌ ์—ฐ๊ตฌ๋“ค์˜ ๊ฐˆ๋ž˜๊ฐ€ ๋งŽ์ด ๋‚˜๋ˆ„์–ด์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์— ์„œ ํ’€๊ณ ์žํ•˜๋Š” ๋ฌธ์ œ๋Š” ํŒŒํŽธํ™”๋œ ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ๋“ค์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๊ณ  3์ฐจ์› ๊ฐ์ฒด ๊ฒ€์ถœ๊ธฐ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๊ฐ€ ์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ ์—์„œ๋Š” ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์˜ 3์ฐจ์› ๊ฐ์ฒด ๊ฒ€์ถœ ๊ธฐ์ˆ ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” 3์ฐจ์› ๊ฒฝ๊ณ„ ์ƒ์ž (3D bounding box) ์˜ ๊ตฌ์กฐ์ ์ธ ์ •๋ณด์˜ ํ™œ์šฉ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ตฌ์กฐ ๊ฐ์‘ํ˜• ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• (PA-AUG) ๊ธฐ๋ฒ•์ด๋‹ค. 3์ฐจ์› ๊ฒฝ๊ณ„ ์ƒ์ž ๋ผ๋ฒจ์€ ๊ฐ์ฒด์— ๋”ฑ ๋งž๊ฒŒ ์ƒ์„ฑ๋˜๊ณ  ๋ฐฉํ–ฅ๊ฐ’์„ ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ƒ์ž ๋‚ด์— ๊ฐ์ฒด์˜ ๊ตฌ์กฐ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” 3์ฐจ์› ๊ฒฝ๊ณ„ ์ƒ์ž๋ฅผ ๊ตฌ์กฐ ๊ฐ์‘ํ˜• ํŒŒํ‹ฐ์…˜์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜๊ณ , ํŒŒํ‹ฐ์…˜ ์ˆ˜์ค€์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ์‹์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. PA-AUG๋Š” ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ 3์ฐจ์› ๊ฐ์ฒด ๊ฒ€์ถœ๊ธฐ๋“ค์˜ ์„ฑ๋Šฅ์„ ๊ฐ•์ธํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๊ณ , ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ 2.5๋ฐฐ ์ฆ ๊ฐ•์‹œํ‚ค๋Š” ๋งŒํผ์˜ ์ธ์‹ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋‘ ๋ฒˆ์งธ๋Š” ํ˜ผํ•ฉ ๋ฐ€๋„ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ 3์ฐจ์› ๊ฐ์ฒด ๊ฒ€์ถœ (MD3D) ๊ธฐ๋ฒ•์ด๋‹ค. MD3D๋Š” ๊ฐ€์šฐ์‹œ๊ฐ„ ํ˜ผํ•ฉ ๋ชจ๋ธ (Gaussian Mixture Model) ์„ ์ด์šฉํ•ด 3์ฐจ์› ๊ฒฝ ๊ณ„ ์ƒ์ž ํšŒ๊ท€ ๋ฌธ์ œ๋ฅผ ๋ฐ€๋„ ์˜ˆ์ธก ๋ฐฉ์‹์œผ๋กœ ์žฌ์ •์˜ํ•œ ๊ธฐ๋ฒ•์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ ๋ผ๋ฒจ ํ• ๋‹น์‹์˜ ํ•™์Šต ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ์ „์ฒ˜๋ฆฌ ํ˜•ํƒœ์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๊ณ  ๋™์ผํ•œ ํ•™์Šต ๋ฐฉ์‹์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๊ธฐ์กด ๋ฐฉ์‹ ๋Œ€๋น„ ํ•™์Šต ์— ํ•„์š”ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ˜„์ €ํžˆ ์ ์–ด์„œ ์ตœ์ ํ™”๊ฐ€ ์šฉ์ดํ•˜์—ฌ ์ธ์‹ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๋†’์ผ ์ˆ˜ ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ„๋‹จํ•œ ๊ตฌ์กฐ๋กœ ์ธํ•ด ์ธ์‹ ์†๋„๋„ ๋นจ๋ผ์ง€๊ฒŒ ๋œ๋‹ค. PA-AUG์™€ MD3D๋Š” ๋ชจ๋‘ ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์— ์ƒ๊ด€์—†์ด ๋‹ค์–‘ํ•œ 3์ฐจ์› ๊ฐ์ฒด ๊ฒ€์ถœ๊ธฐ์— ๊ณตํ†ต์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋†’์€ ์ธ์‹ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‘ ๊ธฐ๋ฒ•์€ ๊ฒ€์ถœ๊ธฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์˜์—ญ์— ์ ์šฉ๋˜๋Š” ๊ธฐ๋ฒ•์ด๋ฏ€๋กœ ํ•จ๊ป˜ ๋™์‹œ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , ํ•จ๊ป˜ ์‚ฌ์šฉํ–ˆ์„๋•Œ ์ธ์‹ ์„ฑ๋Šฅ์ด ๋”์šฑ ํฌ๊ฒŒ ํ–ฅ์ƒ๋œ๋‹ค.LiDAR (Light Detection And Ranging), which is widely used as a sensing device for autonomous vehicles and robots, emits laser pulses and calculates the return time to sense the surrounding environment in the form of a point cloud. When recognizing the surrounding environment, the most important part is recognizing what objects are nearby and where they are located, and 3D object detection methods using point clouds have been actively studied to perform these tasks. Various backbone networks for point cloud-based 3D object detection have been proposed according to the preprocessing method of point cloud data. Although advanced backbone networks have made great strides in detection performance, they are largely different in structure, so there is a lack of compatibility with each other. The problem to be solved in this dissertation is How to improve the performance of 3D object detectors regardless of their diverse backbone network structures?. This dissertation proposes two general methods to improve point cloud-based 3D object detectors. First, we propose a part-aware data augmentation (PA-AUG) method which maximizes the utilization of structural information of 3D bounding boxes. Since the 3D bounding box labels fit the objects boundaries and include the orientation value, they contain the structural information of the object in the box. To fully utilize the intra-object structural information, we propose a novel partaware partitioning method which separates 3D bounding boxes with characteristic sub-parts. PA-AUG applies newly proposed data augmentation methods at the partition level. It makes various types of 3D object detectors robust and brings the equivalent effect of increasing the train data by about 2.5ร—. Second, we propose a mixture-density-based 3D object detection (MD3D). MD3D predicts the distribution of 3D bounding boxes using a Gaussian mixture model (GMM). It reformulates the conventional regression methods as a density estimation problem. Thus, unlike conventional target assignment methods, it can be applied to any 3D object detector regardless of the point cloud preprocessing method. In addition, as it requires significantly fewer hyper-parameters compared to existing methods, it is easy to optimize the detection performance. MD3D also increases the detection speed due to its simple structure. Both PA-AUG and MD3D can be applied to any 3D object detector and shows an impressive increase in detection performance. The two proposed methods cover different stages of the object detection pipeline. Thus, they can be used simultaneously, and the experimental results show they have a synergy effect when applied together.1 Introduction 1 1.1 Problem Definition 3 1.2 Challenges 6 1.3 Contributions 8 1.3.1 Part-Aware Data Augmentation (PA-AUG) 8 1.3.2 Mixture-Density-based 3D Object Detection (MD3D) 9 1.3.3 Combination of PA-AUG and MD3D 10 1.4 Outline 10 2 Related Works 11 2.1 Data augmentation for Object Detection 11 2.1.1 2D Data augmentation 11 2.1.2 3D Data augmentation 12 2.2 LiDAR-based 3D Object Detection 13 2.3 Mixture Density Networks in Computer Vision 15 2.4 Datasets 16 2.4.1 KITTI Dataset 16 2.4.2 Waymo Open Dataset 18 2.5 Evaluation metric 19 2.5.1 Average Precision (AP) 19 2.5.2 Average Orientation Similarity (AOS) 22 2.5.3 Average Precision weighted by Heading (APH) 22 3 Part-Aware Data Augmentation (PA-AUG) 24 3.1 Introduction 24 3.2 Methods 27 3.2.1 Part-Aware Partitioning 27 3.2.2 Part-Aware Data Augmentation 28 3.3 Experiments 33 3.3.1 Results on the KITTI Dataset 33 3.3.2 Robustness Test 36 3.3.3 Data Efficiency Test 38 3.3.4 Ablation Study 40 3.4 Discussion 41 3.5 Conclusion 42 4 Mixture-Density-based 3D Object Detection (MD3D) 43 4.1 Introduction 43 4.2 Methods 47 4.2.1 Modeling Point-cloud-based 3D Object Detection with Mixture Density Network 47 4.2.2 Network Architecture 49 4.2.3 Loss function 52 4.3 Experiments 53 4.3.1 Datasets 53 4.3.2 Experiment Settings 53 4.3.3 Results on the KITTI Dataset 54 4.3.4 Latency of Each Module 56 4.3.5 Results on the Waymo Open Dataset 58 4.3.6 Analyzing Recall by object size 59 4.3.7 Ablation Study 60 4.3.8 Discussion 65 4.4 Conclusion 66 5 Combination of PA-AUG and MD3D 71 5.1 Methods 71 5.2 Experiments 72 5.2.1 Settings 72 5.2.2 Results on the KITTI Dataset 73 5.3 Discussion 76 6 Conclusion 77 6.1 Summary 77 6.2 Limitations and Future works 78 6.2.1 Hyper-parameter-free PA-AUG 78 6.2.2 Redefinition of Part-aware Partitioning 79 6.2.3 Application to other tasks 79 Abstract (In Korean) 94 ๊ฐ์‚ฌ์˜ ๊ธ€ 96๋ฐ•

    An integrated framework for multi-state driver monitoring using heterogeneous loss and attention-based feature decoupling

    Get PDF
    Multi-state driver monitoring is a key technique in building human-centric intelligent driving systems. This paper presents an integrated visual-based multi-state driver monitoring framework that incorporates head rotation, gaze, blinking, and yawning. To solve the challenge of head pose and gaze estimation, this paper proposes a unified network architecture that tackles these estimations as soft classification tasks. A feature decoupling module was developed to decouple the extracted features from different axis domains. Furthermore, a cascade cross-entropy was designed to restrict large deviations during the training phase, which was combined with the other features to form a heterogeneous loss function. In addition, gaze consistency was used to optimize its estimation, which also informed the model architecture design of the gaze estimation task. Finally, the proposed method was verified on several widely used benchmark datasets. Comprehensive experiments were conducted to evaluate the proposed method and the experimental results showed that the proposed method could achieve a state-of-the-art performance compared to other methods
    • โ€ฆ
    corecore