87 research outputs found
Motion Segmentation from a Moving Monocular Camera
Identifying and segmenting moving objects from a moving monocular camera is
difficult when there is unknown camera motion, different types of object
motions and complex scene structures. To tackle these challenges, we take
advantage of two popular branches of monocular motion segmentation approaches:
point trajectory based and optical flow based methods, by synergistically
fusing these two highly complementary motion cues at object level. By doing
this, we are able to model various complex object motions in different scene
structures at once, which has not been achieved by existing methods. We first
obtain object-specific point trajectories and optical flow mask for each common
object in the video, by leveraging the recent foundational models in object
recognition, segmentation and tracking. We then construct two robust affinity
matrices representing the pairwise object motion affinities throughout the
whole video using epipolar geometry and the motion information provided by
optical flow. Finally, co-regularized multi-view spectral clustering is used to
fuse the two affinity matrices and obtain the final clustering. Our method
shows state-of-the-art performance on the KT3DMoSeg dataset, which contains
complex motions and scene structures. Being able to identify moving objects
allows us to remove them for map building when using visual SLAM or SFM.Comment: Accepted by IROS 2023 Workshop on Robotic Perception And Mapping:
Frontier Vision and Learning Technique
Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
We present an end-to-end joint training framework that explicitly models
6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular
camera setup without supervision. Our technical contributions are three-fold.
First, we highlight the fundamental difference between inverse and forward
projection while modeling the individual motion of each rigid object, and
propose a geometrically correct projection pipeline using a neural forward
projection module. Second, we design a unified instance-aware photometric and
geometric consistency loss that holistically imposes self-supervisory signals
for every background and object region. Lastly, we introduce a general-purpose
auto-annotation scheme using any off-the-shelf instance segmentation and
optical flow models to produce video instance segmentation maps that will be
utilized as input to our training pipeline. These proposed elements are
validated in a detailed ablation study. Through extensive experiments conducted
on the KITTI and Cityscapes dataset, our framework is shown to outperform the
state-of-the-art depth and motion estimation methods. Our code, dataset, and
models are available at https://github.com/SeokjuLee/Insta-DM .Comment: Accepted to AAAI 2021. Code/dataset/models are available at
https://github.com/SeokjuLee/Insta-DM. arXiv admin note: substantial text
overlap with arXiv:1912.0935
EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity
Self-supervised monocular scene flow estimation, aiming to understand both 3D
structures and 3D motions from two temporally consecutive monocular images, has
received increasing attention for its simple and economical sensor setup.
However, the accuracy of current methods suffers from the bottleneck of
less-efficient network architecture and lack of motion rigidity for
regularization. In this paper, we propose a superior model named EMR-MSF by
borrowing the advantages of network architecture design under the scope of
supervised learning. We further impose explicit and robust geometric
constraints with an elaborately constructed ego-motion aggregation module where
a rigidity soft mask is proposed to filter out dynamic regions for stable
ego-motion estimation using static regions. Moreover, we propose a motion
consistency loss along with a mask regularization loss to fully exploit static
regions. Several efficient training strategies are integrated including a
gradient detachment technique and an enhanced view synthesis process for better
performance. Our proposed method outperforms the previous self-supervised works
by a large margin and catches up to the performance of supervised methods. On
the KITTI scene flow benchmark, our approach improves the SF-all metric of the
state-of-the-art self-supervised monocular method by 44% and demonstrates
superior performance across sub-tasks including depth and visual odometry,
amongst other self-supervised single-task or multi-task methods.Comment: To appear at ICCV 202
Automatic vehicle detection and tracking in aerial video
This thesis is concerned with the challenging tasks of automatic and real-time vehicle detection and tracking from aerial video. The aim of this thesis is to build an automatic system that can accurately localise any vehicles that appear in aerial video frames and track the target vehicles with trackers.
Vehicle detection and tracking have many applications and this has been an active area of research during recent years; however, it is still a challenge to deal with certain realistic environments. This thesis develops vehicle detection and tracking algorithms which enhance the robustness of detection and tracking beyond the existing approaches. The basis of the vehicle detection system proposed in this thesis has different object categorisation approaches, with colour and texture features in both point and area template forms. The thesis also proposes a novel Self-Learning Tracking and Detection approach, which is an extension to the existing Tracking Learning Detection (TLD) algorithm. There are a number of challenges in vehicle detection and tracking. The most difficult challenge of detection is distinguishing and clustering the target vehicle from the background objects and noises. Under certain conditions, the images captured from Unmanned Aerial Vehicles (UAVs) are also blurred; for example, turbulence may make the vehicle shake during flight. This thesis tackles these challenges by applying integrated multiple feature descriptors for real-time processing.
In this thesis, three vehicle detection approaches are proposed: the HSV-GLCM feature approach, the ISM-SIFT feature approach and the FAST-HoG approach. The general vehicle detection approaches used have highly flexible implicit shape representations. They are based on training samples in both positive and negative sets and use updated classifiers to distinguish the targets. It has been found that the detection results attained by using HSV-GLCM texture features can be affected by blurring problems; the proposed detection algorithms can further segment the edges of the vehicles from the background. Using the point descriptor feature can solve the blurring problem, however, the large amount of information contained in point descriptors can lead to processing times that are too long for real-time applications. So the FAST-HoG approach combining the point feature and the shape feature is proposed. This new approach is able to speed up the process that attains the real-time performance. Finally, a detection approach using HoG with the FAST feature is also proposed. The HoG approach is widely used in object recognition, as it has a strong ability to represent the shape vector of the object. However, the original HoG feature is sensitive to the orientation of the target; this method improves the algorithm by inserting the direction vectors of the targets.
For the tracking process, a novel tracking approach was proposed, an extension of the TLD algorithm, in order to track multiple targets. The extended approach upgrades the original system, which can only track a single target, which must be selected before the detection and tracking process. The greatest challenge to vehicle tracking is long-term tracking. The target object can change its appearance during the process and illumination and scale changes can also occur. The original TLD feature assumed that tracking can make errors during the tracking process, and the accumulation of these errors could cause tracking failure, so the original TLD proposed using a learning approach in between the tracking and the detection by adding a pair of inspectors (positive and negative) to constantly estimate errors. This thesis extends the TLD approach with a new detection method in order to achieve multiple-target tracking. A Forward and Backward Tracking approach has been proposed to eliminate tracking errors and other problems such as occlusion. The main purpose of the proposed tracking system is to learn the features of the targets during tracking and re-train the detection classifier for further processes.
This thesis puts particular emphasis on vehicle detection and tracking in different extreme scenarios such as crowed highway vehicle detection, blurred images and changes in the appearance of the targets. Compared with currently existing detection and tracking approaches, the proposed approaches demonstrate a robust increase in accuracy in each scenario
Recommended from our members
Holoscopic 3D perception for autonomous vehicles
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonAutonomous mobile platforms are going to be huge part of the future transportation and autonomous navigation is the critical part of autonomous platforms. An autonomous mobile platform navigates the vehicle by perceiving the environment through the sensors mount on the vehicle, and acting on the data it receives from these sensors by making sense of the environmental and surroundings. As a result, an autonomous mobile platform
consists of localisation aka positioning and path planning. Both of them require very accurate sensor measurements. In terms of accuracy, sensor can generally be divided into two groups (a) High accuracy sensors like the state-of-the-art in LiDAR and vision sensors e.g. mobile-eye sensor. (b) Low accuracy sensors whereas GPS (accurate within 2-10 metres) sensor and IMU (suffering from drifts) could be fused to improve the other method of positioning. These are expensive process due to offline map creation. To deal with low
accuracy sensors, researchers normally use very complex models, which again run into performance reliability and consistency issue. Furthermore, it is common believe, that when navigating autonomously, perception and
situation cognisance is an important component to navigate safely and there have been a huge research on AI enabled perception such as Mobile Eye and Tesla car which uses 2D cameras for its perception. In this research, an innovative method is proposed to use rich vision sensor holoscopic 3D camera for environment perception with artificial intelligent algorithms to observe road objects and learn their 3D behavioural for reliable detection and recognition. The sensor provides rich information - 3D cubic visual information about the
environment including the very valuable “depth information” to imitate third coordinate of real world. To learn the objects, different AI algorithms are studied and in particular deep learning model is proposed that provides a reasonable good result. To evaluate the innovative holoscopic 3D sensor, we applied to face recognition challenge under different face expression where 2D images are considered to fail. However the holoscopic 3D sensor outperform and delivered outstanding performance by recognising faces under different expression by only training on the neutral face using a simple AI algorithm. Then we design and develop holoscopic perception database of 200000 frames for autonomous car. The experimental result has shown a promising result that AI algorithm, particularly deep learning algorithm learns effectively from holoscopic 3D content compared to traditional 2D images even those DL models which were designed for visual features yet holoscopic 3D images contain motion data which shall be exploited
USegScene: Unsupervised Learning of Depth, Optical Flow and Ego-Motion with Semantic Guidance and Coupled Networks
In this paper we propose USegScene, a framework for semantically guided
unsupervised learning of depth, optical flow and ego-motion estimation for
stereo camera images using convolutional neural networks. Our framework
leverages semantic information for improved regularization of depth and optical
flow maps, multimodal fusion and occlusion filling considering dynamic rigid
object motions as independent SE(3) transformations. Furthermore, complementary
to pure photo-metric matching, we propose matching of semantic features,
pixel-wise classes and object instance borders between the consecutive images.
In contrast to previous methods, we propose a network architecture that jointly
predicts all outputs using shared encoders and allows passing information
across the task-domains, e.g., the prediction of optical flow can benefit from
the prediction of the depth. Furthermore, we explicitly learn the depth and
optical flow occlusion maps inside the network, which are leveraged in order to
improve the predictions in therespective regions. We present results on the
popular KITTI dataset and show that our approach outperforms other methods by a
large margin
Review of constraints on vision-based gesture recognition for human–computer interaction
The ability of computers to recognise hand gestures visually is essential for progress in human-computer interaction. Gesture recognition has applications ranging from sign language to medical assistance to virtual reality. However, gesture recognition is extremely challenging not only because of its diverse contexts, multiple interpretations, and spatio-temporal variations but also because of the complex non-rigid properties of the hand. This study surveys major constraints on vision-based gesture recognition occurring in detection and pre-processing, representation and feature extraction, and recognition. Current challenges are explored in detail
- …