72,191 research outputs found
Partially Supervised Multi-Task Network for Single-View Dietary Assessment
Food volume estimation is an essential step in the pipeline of dietary
assessment and demands the precise depth estimation of the food surface and
table plane. Existing methods based on computer vision require either
multi-image input or additional depth maps, reducing convenience of
implementation and practical significance. Despite the recent advances in
unsupervised depth estimation from a single image, the achieved performance in
the case of large texture-less areas needs to be improved. In this paper, we
propose a network architecture that jointly performs geometric understanding
(i.e., depth prediction and 3D plane estimation) and semantic prediction on a
single food image, enabling a robust and accurate food volume estimation
regardless of the texture characteristics of the target plane. For the training
of the network, only monocular videos with semantic ground truth are required,
while the depth map and 3D plane ground truth are no longer needed.
Experimental results on two separate food image databases demonstrate that our
method performs robustly on texture-less scenarios and is superior to
unsupervised networks and structure from motion based approaches, while it
achieves comparable performance to fully-supervised methods
Ego-motion and Surrounding Vehicle State Estimation Using a Monocular Camera
Understanding ego-motion and surrounding vehicle state is essential to enable
automated driving and advanced driving assistance technologies. Typical
approaches to solve this problem use fusion of multiple sensors such as LiDAR,
camera, and radar to recognize surrounding vehicle state, including position,
velocity, and orientation. Such sensing modalities are overly complex and
costly for production of personal use vehicles. In this paper, we propose a
novel machine learning method to estimate ego-motion and surrounding vehicle
state using a single monocular camera. Our approach is based on a combination
of three deep neural networks to estimate the 3D vehicle bounding box, depth,
and optical flow from a sequence of images. The main contribution of this paper
is a new framework and algorithm that integrates these three networks in order
to estimate the ego-motion and surrounding vehicle state. To realize more
accurate 3D position estimation, we address ground plane correction in
real-time. The efficacy of the proposed method is demonstrated through
experimental evaluations that compare our results to ground truth data
available from other sensors including Can-Bus and LiDAR
Pop-up SLAM: Semantic Monocular Plane SLAM for Low-texture Environments
Existing simultaneous localization and mapping (SLAM) algorithms are not
robust in challenging low-texture environments because there are only few
salient features. The resulting sparse or semi-dense map also conveys little
information for motion planning. Though some work utilize plane or scene layout
for dense map regularization, they require decent state estimation from other
sources. In this paper, we propose real-time monocular plane SLAM to
demonstrate that scene understanding could improve both state estimation and
dense mapping especially in low-texture environments. The plane measurements
come from a pop-up 3D plane model applied to each single image. We also combine
planes with point based SLAM to improve robustness. On a public TUM dataset,
our algorithm generates a dense semantic 3D model with pixel depth error of 6.2
cm while existing SLAM algorithms fail. On a 60 m long dataset with loops, our
method creates a much better 3D model with state estimation error of 0.67%.Comment: International Conference on Intelligent Robots and Systems (IROS)
201
3D Visual Perception for Self-Driving Cars using a Multi-Camera System: Calibration, Mapping, Localization, and Obstacle Detection
Cameras are a crucial exteroceptive sensor for self-driving cars as they are
low-cost and small, provide appearance information about the environment, and
work in various weather conditions. They can be used for multiple purposes such
as visual navigation and obstacle detection. We can use a surround multi-camera
system to cover the full 360-degree field-of-view around the car. In this way,
we avoid blind spots which can otherwise lead to accidents. To minimize the
number of cameras needed for surround perception, we utilize fisheye cameras.
Consequently, standard vision pipelines for 3D mapping, visual localization,
obstacle detection, etc. need to be adapted to take full advantage of the
availability of multiple cameras rather than treat each camera individually. In
addition, processing of fisheye images has to be supported. In this paper, we
describe the camera calibration and subsequent processing pipeline for
multi-fisheye-camera systems developed as part of the V-Charge project. This
project seeks to enable automated valet parking for self-driving cars. Our
pipeline is able to precisely calibrate multi-camera systems, build sparse 3D
maps for visual navigation, visually localize the car with respect to these
maps, generate accurate dense maps, as well as detect obstacles based on
real-time depth map extraction
- …