47 research outputs found
SimCol3D -- 3D Reconstruction during Colonoscopy Challenge
Colorectal cancer is one of the most common cancers in the world. While
colonoscopy is an effective screening technique, navigating an endoscope
through the colon to detect polyps is challenging. A 3D map of the observed
surfaces could enhance the identification of unscreened colon tissue and serve
as a training platform. However, reconstructing the colon from video footage
remains unsolved due to numerous factors such as self-occlusion, reflective
surfaces, lack of texture, and tissue deformation that limit feature-based
methods. Learning-based approaches hold promise as robust alternatives, but
necessitate extensive datasets. By establishing a benchmark, the 2022 EndoVis
sub-challenge SimCol3D aimed to facilitate data-driven depth and pose
prediction during colonoscopy. The challenge was hosted as part of MICCAI 2022
in Singapore. Six teams from around the world and representatives from academia
and industry participated in the three sub-challenges: synthetic depth
prediction, synthetic pose prediction, and real pose prediction. This paper
describes the challenge, the submitted methods, and their results. We show that
depth prediction in virtual colonoscopy is robustly solvable, while pose
estimation remains an open research question
Improved deep depth estimation for environments with sparse visual cues
Most deep learning-based depth estimation models that learn scene structure self-supervised from monocular video base their estimation on visual cues such as vanishing points. In the established depth estimation benchmarks depicting, for example, street navigation or indoor offices, these cues can be found consistently, which enables neural networks to predict depth maps from single images. In this work, we are addressing the challenge of depth estimation from a real-world bird’s-eye perspective in an industry environment which contains, conditioned by its special geometry, a minimal amount of visual cues and, hence, requires incorporation of the temporal domain for structure from motion estimation. To enable the system to incorporate structure from motion from pixel translation when facing context-sparse, i.e., visual cue sparse, scenery, we propose a novel architecture built upon the structure from motion learner, which uses temporal pairs of jointly unrotated and stacked images for depth prediction. In order to increase the overall performance and to avoid blurred depth edges that lie in between the edges of the two input images, we integrate a geometric consistency loss into our pipeline. We assess the model’s ability to learn structure from motion by introducing a novel industry dataset whose perspective, orthogonal to the floor, contains only minimal visual cues. Through the evaluation with ground truth depth, we show that our proposed method outperforms the state of the art in difficult context-sparse environments.Peer reviewe
Fusion-GRU: A Deep Learning Model for Future Bounding Box Prediction of Traffic Agents in Risky Driving Videos
To ensure the safe and efficient navigation of autonomous vehicles and
advanced driving assistance systems in complex traffic scenarios, predicting
the future bounding boxes of surrounding traffic agents is crucial. However,
simultaneously predicting the future location and scale of target traffic
agents from the egocentric view poses challenges due to the vehicle's egomotion
causing considerable field-of-view changes. Moreover, in anomalous or risky
situations, tracking loss or abrupt motion changes limit the available
observation time, requiring learning of cues within a short time window.
Existing methods typically use a simple concatenation operation to combine
different cues, overlooking their dynamics over time. To address this, this
paper introduces the Fusion-Gated Recurrent Unit (Fusion-GRU) network, a novel
encoder-decoder architecture for future bounding box localization. Unlike
traditional GRUs, Fusion-GRU accounts for mutual and complex interactions among
input features. Moreover, an intermediary estimator coupled with a
self-attention aggregation layer is also introduced to learn sequential
dependencies for long range prediction. Finally, a GRU decoder is employed to
predict the future bounding boxes. The proposed method is evaluated on two
publicly available datasets, ROL and HEV-I. The experimental results showcase
the promising performance of the Fusion-GRU, demonstrating its effectiveness in
predicting future bounding boxes of traffic agents
DeepSLAM: A Robust Monocular SLAM System with Unsupervised Deep Learning
In this paper, we propose DeepSLAM, a novel unsupervised deep learning-based visual Simultaneous Localization and Mapping (SLAM) system. The DeepSLAM training is fully unsupervised since it only requires stereo imagery instead of annotating ground-truth poses. Its testing takes a monocular image sequence as the input. Therefore, it is a monocular SLAM paradigm. DeepSLAM consists of several essential components, including Mapping-Net, Tracking-Net, Loop-Net and a graph optimization unit. Specifically, the Mapping-Net is an encoder and decoder architecture for describing the 3D structure of the environment while the Tracking-Net is a Recurrent Convolutional Neural Network (RCNN) architecture for capturing the camera motion. The Loop-Net is a pre-trained binary classifier for detecting loop closures. DeepSLAM can simultaneously generate pose estimate, depth map and outlier rejection mask. We evaluate its performance on various datasets, and find that DeepSLAM achieves good performance in terms of pose estimation accuracy, and is robust in some challenging scenes
Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion
In this work, we propose a novel framework for unsupervised learning for
event cameras that learns motion information from only the event stream. In
particular, we propose an input representation of the events in the form of a
discretized volume that maintains the temporal distribution of the events,
which we pass through a neural network to predict the motion of the events.
This motion is used to attempt to remove any motion blur in the event image. We
then propose a loss function applied to the motion compensated event image that
measures the motion blur in this image. We train two networks with this
framework, one to predict optical flow, and one to predict egomotion and
depths, and evaluate these networks on the Multi Vehicle Stereo Event Camera
dataset, along with qualitative results from a variety of different scenes.Comment: 9 pages, 7 figure
Learning Optical Flow, Depth, and Scene Flow without Real-World Labels
Self-supervised monocular depth estimation enables robots to learn 3D
perception from raw video streams. This scalable approach leverages projective
geometry and ego-motion to learn via view synthesis, assuming the world is
mostly static. Dynamic scenes, which are common in autonomous driving and
human-robot interaction, violate this assumption. Therefore, they require
modeling dynamic objects explicitly, for instance via estimating pixel-wise 3D
motion, i.e. scene flow. However, the simultaneous self-supervised learning of
depth and scene flow is ill-posed, as there are infinitely many combinations
that result in the same 3D point. In this paper we propose DRAFT, a new method
capable of jointly learning depth, optical flow, and scene flow by combining
synthetic data with geometric self-supervision. Building upon the RAFT
architecture, we learn optical flow as an intermediate task to bootstrap depth
and scene flow learning via triangulation. Our algorithm also leverages
temporal and geometric consistency losses across tasks to improve multi-task
learning. Our DRAFT architecture simultaneously establishes a new state of the
art in all three tasks in the self-supervised monocular setting on the standard
KITTI benchmark. Project page: https://sites.google.com/tri.global/draft.Comment: Accepted to RA-L + ICRA 202