63 research outputs found

    DeepSignals: Predicting Intent of Drivers Through Visual Signals

    Full text link
    Detecting the intention of drivers is an essential task in self-driving, necessary to anticipate sudden events like lane changes and stops. Turn signals and emergency flashers communicate such intentions, providing seconds of potentially critical reaction time. In this paper, we propose to detect these signals in video sequences by using a deep neural network that reasons about both spatial and temporal information. Our experiments on more than a million frames show high per-frame accuracy in very challenging scenarios.Comment: To be presented at the IEEE International Conference on Robotics and Automation (ICRA), 201

    LSTA: Long Short-Term Attention for Egocentric Action Recognition

    Get PDF
    Egocentric activity recognition is one of the most challenging tasks in video analysis. It requires a fine-grained discrimination of small objects and their manipulation. While some methods base on strong supervision and attention mechanisms, they are either annotation consuming or do not take spatio-temporal patterns into account. In this paper we propose LSTA as a mechanism to focus on features from spatial relevant parts while attention is being tracked smoothly across the video sequence. We demonstrate the effectiveness of LSTA on egocentric activity recognition with an end-to-end trainable two-stream architecture, achieving state of the art performance on four standard benchmarks.Comment: Accepted to CVPR 201

    LSTA: Long Short-Term Attention for Egocentric Action Recognition

    Get PDF
    Egocentric activity recognition is one of the most challenging tasks in video analysis. It requires a fine-grained discrimination of small objects and their manipulation. While some methods base on strong supervision and attention mechanisms, they are either annotation consuming or do not take spatio-temporal patterns into account. In this paper we propose LSTA as a mechanism to focus on features from spatial relevant parts while attention is being tracked smoothly across the video sequence. We demonstrate the effectiveness of LSTA on egocentric activity recognition with an end-to-end trainable two-stream architecture, achieving state-of-the-art performance on four standard benchmarks

    Deep Learning Techniques for Video Instance Segmentation: A Survey

    Full text link
    Video instance segmentation, also known as multi-object tracking and segmentation, is an emerging computer vision research area introduced in 2019, aiming at detecting, segmenting, and tracking instances in videos simultaneously. By tackling the video instance segmentation tasks through effective analysis and utilization of visual information in videos, a range of computer vision-enabled applications (e.g., human action recognition, medical image processing, autonomous vehicle navigation, surveillance, etc) can be implemented. As deep-learning techniques take a dominant role in various computer vision areas, a plethora of deep-learning-based video instance segmentation schemes have been proposed. This survey offers a multifaceted view of deep-learning schemes for video instance segmentation, covering various architectural paradigms, along with comparisons of functional performance, model complexity, and computational overheads. In addition to the common architectural designs, auxiliary techniques for improving the performance of deep-learning models for video instance segmentation are compiled and discussed. Finally, we discuss a range of major challenges and directions for further investigations to help advance this promising research field

    A Deep Learning Approach to Object Affordance Segmentation

    Full text link
    Learning to understand and infer object functionalities is an important step towards robust visual intelligence. Significant research efforts have recently focused on segmenting the object parts that enable specific types of human-object interaction, the so-called "object affordances". However, most works treat it as a static semantic segmentation problem, focusing solely on object appearance and relying on strong supervision and object detection. In this paper, we propose a novel approach that exploits the spatio-temporal nature of human-object interaction for affordance segmentation. In particular, we design an autoencoder that is trained using ground-truth labels of only the last frame of the sequence, and is able to infer pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism that enables the implicit localization of the interaction hotspot. For evaluation purposes, we introduce the SOR3D-AFF corpus, which consists of human-object interaction sequences and supports 9 types of affordances in terms of pixel-wise annotation, covering typical manipulations of tool-like objects. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF, while being able to predict affordances for similar unseen objects in two affordance image-only datasets.Comment: 5 pages, 4 figures, ICASSP 202

    Pedestrian Attribute Recognition: A Survey

    Full text link
    Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey: https://sites.google.com/view/ahu-pedestrianattributes

    3D 손 포즈 인식을 μœ„ν•œ 인쑰 λ°μ΄ν„°μ˜ 이용

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› μœ΅ν•©κ³Όν•™λΆ€(지λŠ₯ν˜•μœ΅ν•©μ‹œμŠ€ν…œμ „κ³΅), 2021.8. μ–‘ν•œμ—΄.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures. Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks. Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D μ΄λ―Έμ§€μ—μ„œ μ‚¬λžŒμ˜ 손 λͺ¨μ–‘κ³Ό 포즈λ₯Ό μΈμ‹ν•˜κ³  κ΅¬ν˜„νλŠ” μ—°κ΅¬λŠ” 각 손가락 μ‘°μΈνŠΈλ“€μ˜ 3D μœ„μΉ˜λ₯Ό κ²€μΆœν•˜λŠ” 것을 λͺ©ν‘œλ‘œν•œλ‹€. 손 ν¬μ¦ˆλŠ” 손가락 μ‘°μΈνŠΈλ“€λ‘œ κ΅¬μ„±λ˜μ–΄ 있고 손λͺ© κ΄€μ ˆλΆ€ν„° MCP, PIP, DIP μ‘°μΈνŠΈλ“€λ‘œ μ‚¬λžŒ 손을 κ΅¬μ„±ν•˜λŠ” 신체적 μš”μ†Œλ“€μ„ μ˜λ―Έν•œλ‹€. 손 포즈 μ •λ³΄λŠ” λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ ν™œμš©λ μˆ˜ 있고 손 제슀쳐 감지 연ꡬ λΆ„μ•Όμ—μ„œ 손 포즈 정보가 맀우 ν›Œλ₯­ν•œ μž…λ ₯ νŠΉμ§• κ°’μœΌλ‘œ μ‚¬μš©λœλ‹€. μ‚¬λžŒμ˜ 손 포즈 κ²€μΆœ 연ꡬλ₯Ό μ‹€μ œ μ‹œμŠ€ν…œμ— μ μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” 높은 정확도, μ‹€μ‹œκ°„μ„±, λ‹€μ–‘ν•œ 기기에 μ‚¬μš© κ°€λŠ₯ν•˜λ„λ‘ κ°€λ²Όμš΄ λͺ¨λΈμ΄ ν•„μš”ν•˜κ³ , 이것을 κ°€λŠ₯μΌ€ ν•˜κΈ° μœ„ν•΄μ„œ ν•™μŠ΅ν•œ 인곡신경망 λͺ¨λΈμ„ ν•™μŠ΅ν•˜λŠ”λ°μ—λŠ” λ§Žμ€ 데이터가 ν•„μš”λ‘œ ν•œλ‹€. ν•˜μ§€λ§Œ μ‚¬λžŒ 손 포즈λ₯Ό μΈ‘μ •ν•˜λŠ” 기계듀이 κ½€ λΆˆμ•ˆμ •ν•˜κ³ , 이 기계듀을 μž₯μ°©ν•˜κ³  μžˆλŠ” μ΄λ―Έμ§€λŠ” μ‚¬λžŒ 손 ν”ΌλΆ€ μƒ‰κ³ΌλŠ” 많이 달라 ν•™μŠ΅μ— μ‚¬μš©ν•˜κΈ°κ°€ μ μ ˆν•˜μ§€ μ•Šλ‹€. 그러기 λ•Œλ¬Έμ— λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ΄λŸ¬ν•œ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 인곡적으둜 λ§Œλ“€μ–΄λ‚Έ 데이터λ₯Ό μž¬κ°€κ³΅ 및 μ¦λŸ‰ν•˜μ—¬ ν•™μŠ΅μ— μ‚¬μš©ν•˜κ³ , 그것을 톡해 더 쒋은 ν•™μŠ΅μ„±κ³Όλ₯Ό 이루렀고 ν•œλ‹€. 인곡적으둜 λ§Œλ“€μ–΄λ‚Έ μ‚¬λžŒ 손 이미지 데이터듀은 μ‹€μ œ μ‚¬λžŒ 손 ν”ΌλΆ€μƒ‰κ³ΌλŠ” λΉ„μŠ·ν• μ§€μ–Έμ • λ””ν…ŒμΌν•œ ν…μŠ€μ³κ°€ 많이 달라, μ‹€μ œλ‘œ 인곡 데이터λ₯Ό ν•™μŠ΅ν•œ λͺ¨λΈμ€ μ‹€μ œ 손 λ°μ΄ν„°μ—μ„œ μ„±λŠ₯이 ν˜„μ €νžˆ 많이 떨어진닀. 이 두 λ°μ΄νƒ€μ˜ 도메인을 쀄이기 μœ„ν•΄μ„œ μ²«λ²ˆμ§Έλ‘œλŠ” μ‚¬λžŒμ†μ˜ ꡬ쑰λ₯Ό λ¨Όμ € ν•™μŠ΅ μ‹œν‚€κΈ°μœ„ν•΄, 손 λͺ¨μ…˜μ„ μž¬κ°€κ³΅ν•˜μ—¬ κ·Έ μ›€μ§μž„ ꡬ쑰λ₯Ό ν•™μŠ€ν•œ μ‹œκ°„μ  정보λ₯Ό λΊ€ λ‚˜λ¨Έμ§€λ§Œ μ‹€μ œ 손 이미지 데이터에 ν•™μŠ΅ν•˜μ˜€κ³  크게 효과λ₯Ό λ‚΄μ—ˆλ‹€. μ΄λ•Œ μ‹€μ œ μ‚¬λžŒ 손λͺ¨μ…˜μ„ λͺ¨λ°©ν•˜λŠ” 방법둠을 μ œμ‹œν•˜μ˜€λ‹€. λ‘λ²ˆμ§Έλ‘œλŠ” 두 도메인이 λ‹€λ₯Έ 데이터λ₯Ό λ„€νŠΈμ›Œν¬ 피쳐 κ³΅κ°„μ—μ„œ alignμ‹œμΌ°λ‹€. κ·ΈλΏλ§Œμ•„λ‹ˆλΌ 인곡 포즈λ₯Ό νŠΉμ • λ°μ΄ν„°λ“€λ‘œ augmentν•˜μ§€ μ•Šκ³  λ„€νŠΈμ›Œν¬κ°€ 많이 보지 λͺ»ν•œ ν¬μ¦ˆκ°€ λ§Œλ“€μ–΄μ§€λ„λ‘ ν•˜λ‚˜μ˜ ν™•λ₯  λͺ¨λΈλ‘œμ„œ μ„€μ •ν•˜μ—¬ κ·Έκ²ƒμ—μ„œ μƒ˜ν”Œλ§ν•˜λŠ” ꡬ쑰λ₯Ό μ œμ•ˆν•˜μ˜€λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 인곡 데이터λ₯Ό 더 효과적으둜 μ‚¬μš©ν•˜μ—¬ annotation이 μ–΄λ €μš΄ μ‹€μ œ 데이터λ₯Ό 더 λͺ¨μœΌλŠ” μˆ˜κ³ μŠ€λŸ¬μ›€ 없이 인곡 데이터듀을 더 효과적으둜 λ§Œλ“€μ–΄ λ‚΄λŠ” 것 뿐만 μ•„λ‹ˆλΌ, 더 μ•ˆμ „ν•˜κ³  지역적 νŠΉμ§•κ³Ό μ‹œκ°„μ  νŠΉμ§•μ„ ν™œμš©ν•΄μ„œ 포즈의 μ„±λŠ₯을 κ°œμ„ ν•˜λŠ” 방법듀을 μ œμ•ˆν–ˆλ‹€. λ˜ν•œ, λ„€νŠΈμ›Œν¬κ°€ 슀슀둜 ν•„μš”ν•œ 데이터λ₯Ό μ°Ύμ•„μ„œ ν•™μŠ΅ν• μˆ˜ μžˆλŠ” μžλ™ 데이터 μ¦λŸ‰ 방법둠도 ν•¨κ»˜ μ œμ•ˆν•˜μ˜€λ‹€. μ΄λ ‡κ²Œ μ œμ•ˆλœ 방법을 κ²°ν•©ν•΄μ„œ 더 λ‚˜μ€ 손 포즈의 μ„±λŠ₯을 ν–₯상 ν•  수 μžˆλ‹€.1. Introduction 1 2. Related Works 14 3. Preliminaries: 3D Hand Mesh Model 27 4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31 5. Hand Pose Auto-Augment 66 6. Conclusion 85 Abstract (Korea) 101 κ°μ‚¬μ˜ κΈ€ 103λ°•

    STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

    Full text link
    We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.Comment: CVPR 202

    Deep Learning for Video Object Segmentation:A Review

    Get PDF
    As one of the fundamental problems in the field of video understanding, video object segmentation aims at segmenting objects of interest throughout the given video sequence. Recently, with the advancements of deep learning techniques, deep neural networks have shown outstanding performance improvements in many computer vision applications, with video object segmentation being one of the most advocated and intensively investigated. In this paper, we present a systematic review of the deep learning-based video segmentation literature, highlighting the pros and cons of each category of approaches. Concretely, we start by introducing the definition, background concepts and basic ideas of algorithms in this field. Subsequently, we summarise the datasets for training and testing a video object segmentation algorithm, as well as common challenges and evaluation metrics. Next, previous works are grouped and reviewed based on how they extract and use spatial and temporal features, where their architectures, contributions and the differences among each other are elaborated. At last, the quantitative and qualitative results of several representative methods on a dataset with many remaining challenges are provided and analysed, followed by further discussions on future research directions. This article is expected to serve as a tutorial and source of reference for learners intended to quickly grasp the current progress in this research area and practitioners interested in applying the video object segmentation methods to their problems. A public website is built to collect and track the related works in this field: https://github.com/gaomingqi/VOS-Review
    • …
    corecore