3,235 research outputs found
Object-based 2D-to-3D video conversion for effective stereoscopic content generation in 3D-TV applications
Three-dimensional television (3D-TV) has gained increasing popularity in the broadcasting domain, as it enables enhanced viewing experiences in comparison to conventional two-dimensional (2D) TV. However, its application has been constrained due to the lack of essential contents, i.e., stereoscopic videos. To alleviate such content shortage, an economical and practical solution is to reuse the huge media resources that are available in monoscopic 2D and convert them to stereoscopic 3D. Although stereoscopic video can be generated from monoscopic sequences using depth measurements extracted from cues like focus blur, motion and size, the quality of the resulting video may be poor as such measurements are usually arbitrarily defined and appear inconsistent with the real scenes. To help solve this problem, a novel method for object-based stereoscopic video generation is proposed which features i) optical-flow based occlusion reasoning in determining depth ordinal, ii) object segmentation using improved region-growing from masks of determined depth layers, and iii) a hybrid depth estimation scheme using content-based matching (inside a small library of true stereo image pairs) and depth-ordinal based regularization. Comprehensive experiments have validated the effectiveness of our proposed 2D-to-3D conversion method in generating stereoscopic videos of consistent depth measurements for 3D-TV applications
Attention Gated Networks: Learning to Leverage Salient Regions in Medical Images
We propose a novel attention gate (AG) model for medical image analysis that
automatically learns to focus on target structures of varying shapes and sizes.
Models trained with AGs implicitly learn to suppress irrelevant regions in an
input image while highlighting salient features useful for a specific task.
This enables us to eliminate the necessity of using explicit external
tissue/organ localisation modules when using convolutional neural networks
(CNNs). AGs can be easily integrated into standard CNN models such as VGG or
U-Net architectures with minimal computational overhead while increasing the
model sensitivity and prediction accuracy. The proposed AG models are evaluated
on a variety of tasks, including medical image classification and segmentation.
For classification, we demonstrate the use case of AGs in scan plane detection
for fetal ultrasound screening. We show that the proposed attention mechanism
can provide efficient object localisation while improving the overall
prediction performance by reducing false positives. For segmentation, the
proposed architecture is evaluated on two large 3D CT abdominal datasets with
manual annotations for multiple organs. Experimental results show that AG
models consistently improve the prediction performance of the base
architectures across different datasets and training sizes while preserving
computational efficiency. Moreover, AGs guide the model activations to be
focused around salient regions, which provides better insights into how model
predictions are made. The source code for the proposed AG models is publicly
available.Comment: Accepted for Medical Image Analysis (Special Issue on Medical Imaging
with Deep Learning). arXiv admin note: substantial text overlap with
arXiv:1804.03999, arXiv:1804.0533
GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB
We address the highly challenging problem of real-time 3D hand tracking based
on a monocular RGB-only sequence. Our tracking method combines a convolutional
neural network with a kinematic 3D hand model, such that it generalizes well to
unseen data, is robust to occlusions and varying camera viewpoints, and leads
to anatomically plausible as well as temporally smooth hand motions. For
training our CNN we propose a novel approach for the synthetic generation of
training data that is based on a geometrically consistent image-to-image
translation network. To be more specific, we use a neural network that
translates synthetic images to "real" images, such that the so-generated images
follow the same statistical distribution as real-world hand images. For
training this translation network we combine an adversarial loss and a
cycle-consistency loss with a geometric consistency loss in order to preserve
geometric properties (such as hand pose) during translation. We demonstrate
that our hand tracking system outperforms the current state-of-the-art on
challenging RGB-only footage
Boundary-semantic collaborative guidance network with dual-stream feedback mechanism for salient object detection in optical remote sensing imagery
With the increasing application of deep learning in various domains, salient
object detection in optical remote sensing images (ORSI-SOD) has attracted
significant attention. However, most existing ORSI-SOD methods predominantly
rely on local information from low-level features to infer salient boundary
cues and supervise them using boundary ground truth, but fail to sufficiently
optimize and protect the local information, and almost all approaches ignore
the potential advantages offered by the last layer of the decoder to maintain
the integrity of saliency maps. To address these issues, we propose a novel
method named boundary-semantic collaborative guidance network (BSCGNet) with
dual-stream feedback mechanism. First, we propose a boundary protection
calibration (BPC) module, which effectively reduces the loss of edge position
information during forward propagation and suppresses noise in low-level
features without relying on boundary ground truth. Second, based on the BPC
module, a dual feature feedback complementary (DFFC) module is proposed, which
aggregates boundary-semantic dual features and provides effective feedback to
coordinate features across different layers, thereby enhancing cross-scale
knowledge communication. Finally, to obtain more complete saliency maps, we
consider the uniqueness of the last layer of the decoder for the first time and
propose the adaptive feedback refinement (AFR) module, which further refines
feature representation and eliminates differences between features through a
unique feedback mechanism. Extensive experiments on three benchmark datasets
demonstrate that BSCGNet exhibits distinct advantages in challenging scenarios
and outperforms the 17 state-of-the-art (SOTA) approaches proposed in recent
years. Codes and results have been released on GitHub:
https://github.com/YUHsss/BSCGNet.Comment: Accepted by TGR
Cognitive fusion of thermal and visible imagery for effective detection and tracking of pedestrians in videos
BACKGROUND INTRODUCTION In this paper, we present an efficient framework to cognitively detect and track salient objects from videos. In general, colored visible image in red-green-blue (RGB) has better distinguishability in human visual perception, yet it suffers from the effect of illumination noise and shadows. On the contrary, the thermal image is less sensitive to these noise effects though its distinguishability varies according to environmental settings. To this end, cognitive fusion of these two modalities provides an effective solution to tackle this problem. METHODS First, a background model is extracted followed by two stage background-subtraction for foreground detection in visible and thermal images. To deal with cases of occlusion or overlap, knowledge based forward tracking and backward tracking are employed to identify separate objects even the foreground detection fails. RESULTS To evaluate the proposed method, a publicly available color-thermal benchmark dataset OTCBVS is employed here. For our foreground detection evaluation, objective and subjective analysis against several state-of-the-art methods have been done on our manually segmented ground truth. For our object tracking evaluation, comprehensive qualitative experiments have also been done on all video sequences. CONCLUSIONS Promising results have shown that the proposed fusion based approach can successfully detect and track multiple human objects in most scenes regardless of any light change or occlusion problem
Deep Learning Techniques for Video Instance Segmentation: A Survey
Video instance segmentation, also known as multi-object tracking and
segmentation, is an emerging computer vision research area introduced in 2019,
aiming at detecting, segmenting, and tracking instances in videos
simultaneously. By tackling the video instance segmentation tasks through
effective analysis and utilization of visual information in videos, a range of
computer vision-enabled applications (e.g., human action recognition, medical
image processing, autonomous vehicle navigation, surveillance, etc) can be
implemented. As deep-learning techniques take a dominant role in various
computer vision areas, a plethora of deep-learning-based video instance
segmentation schemes have been proposed. This survey offers a multifaceted view
of deep-learning schemes for video instance segmentation, covering various
architectural paradigms, along with comparisons of functional performance,
model complexity, and computational overheads. In addition to the common
architectural designs, auxiliary techniques for improving the performance of
deep-learning models for video instance segmentation are compiled and
discussed. Finally, we discuss a range of major challenges and directions for
further investigations to help advance this promising research field
- …