177 research outputs found
Confidence Propagation through CNNs for Guided Sparse Depth Regression
Generally, convolutional neural networks (CNNs) process data on a regular
grid, e.g. data generated by ordinary cameras. Designing CNNs for sparse and
irregularly spaced input data is still an open research problem with numerous
applications in autonomous driving, robotics, and surveillance. In this paper,
we propose an algebraically-constrained normalized convolution layer for CNNs
with highly sparse input that has a smaller number of network parameters
compared to related work. We propose novel strategies for determining the
confidence from the convolution operation and propagating it to consecutive
layers. We also propose an objective function that simultaneously minimizes the
data error while maximizing the output confidence. To integrate structural
information, we also investigate fusion strategies to combine depth and RGB
information in our normalized convolution network framework. In addition, we
introduce the use of output confidence as an auxiliary information to improve
the results. The capabilities of our normalized convolution network framework
are demonstrated for the problem of scene depth completion. Comprehensive
experiments are performed on the KITTI-Depth and the NYU-Depth-v2 datasets. The
results clearly demonstrate that the proposed approach achieves superior
performance while requiring only about 1-5% of the number of parameters
compared to the state-of-the-art methods.Comment: 14 pages, 14 Figure
Propagating Confidences through CNNs for Sparse Data Regression
In most computer vision applications, convolutional neural networks (CNNs)
operate on dense image data generated by ordinary cameras. Designing CNNs for
sparse and irregularly spaced input data is still an open problem with numerous
applications in autonomous driving, robotics, and surveillance. To tackle this
challenging problem, we introduce an algebraically-constrained convolution
layer for CNNs with sparse input and demonstrate its capabilities for the scene
depth completion task. We propose novel strategies for determining the
confidence from the convolution operation and propagating it to consecutive
layers. Furthermore, we propose an objective function that simultaneously
minimizes the data error while maximizing the output confidence. Comprehensive
experiments are performed on the KITTI depth benchmark and the results clearly
demonstrate that the proposed approach achieves superior performance while
requiring three times fewer parameters than the state-of-the-art methods.
Moreover, our approach produces a continuous pixel-wise confidence map enabling
information fusion, state inference, and decision support.Comment: To appear in the British Machine Vision Conference (BMVC2018
Deep Motion Features for Visual Tracking
Robust visual tracking is a challenging computer vision problem, with many
real-world applications. Most existing approaches employ hand-crafted
appearance features, such as HOG or Color Names. Recently, deep RGB features
extracted from convolutional neural networks have been successfully applied for
tracking. Despite their success, these features only capture appearance
information. On the other hand, motion cues provide discriminative and
complementary information that can improve tracking performance. Contrary to
visual tracking, deep motion features have been successfully applied for action
recognition and video classification tasks. Typically, the motion features are
learned by training a CNN on optical flow images extracted from large amounts
of labeled videos.
This paper presents an investigation of the impact of deep motion features in
a tracking-by-detection framework. We further show that hand-crafted, deep RGB,
and deep motion features contain complementary information. To the best of our
knowledge, we are the first to propose fusing appearance information with deep
motion features for visual tracking. Comprehensive experiments clearly suggest
that our fusion approach with deep motion features outperforms standard methods
relying on appearance information alone.Comment: ICPR 2016. Best paper award in the "Computer Vision and Robot Vision"
trac
Discriminative Scale Space Tracking
Accurate scale estimation of a target is a challenging research problem in
visual object tracking. Most state-of-the-art methods employ an exhaustive
scale search to estimate the target size. The exhaustive search strategy is
computationally expensive and struggles when encountered with large scale
variations. This paper investigates the problem of accurate and robust scale
estimation in a tracking-by-detection framework. We propose a novel scale
adaptive tracking approach by learning separate discriminative correlation
filters for translation and scale estimation. The explicit scale filter is
learned online using the target appearance sampled at a set of different
scales. Contrary to standard approaches, our method directly learns the
appearance change induced by variations in the target scale. Additionally, we
investigate strategies to reduce the computational cost of our approach.
Extensive experiments are performed on the OTB and the VOT2014 datasets.
Compared to the standard exhaustive scale search, our approach achieves a gain
of 2.5% in average overlap precision on the OTB dataset. Additionally, our
method is computationally efficient, operating at a 50% higher frame rate
compared to the exhaustive scale search. Our method obtains the top rank in
performance by outperforming 19 state-of-the-art trackers on OTB and 37
state-of-the-art trackers on VOT2014.Comment: To appear in TPAMI. This is the journal extension of the
VOT2014-winning DSST tracking metho
Language Guided Domain Generalized Medical Image Segmentation
Single source domain generalization (SDG) holds promise for more reliable and
consistent image segmentation across real-world clinical settings particularly
in the medical domain, where data privacy and acquisition cost constraints
often limit the availability of diverse datasets. Depending solely on visual
features hampers the model's capacity to adapt effectively to various domains,
primarily because of the presence of spurious correlations and domain-specific
characteristics embedded within the image features. Incorporating text features
alongside visual features is a potential solution to enhance the model's
understanding of the data, as it goes beyond pixel-level information to provide
valuable context. Textual cues describing the anatomical structures, their
appearances, and variations across various imaging modalities can guide the
model in domain adaptation, ultimately contributing to more robust and
consistent segmentation. In this paper, we propose an approach that explicitly
leverages textual information by incorporating a contrastive learning mechanism
guided by the text encoder features to learn a more robust feature
representation. We assess the effectiveness of our text-guided contrastive
feature alignment technique in various scenarios, including cross-modality,
cross-sequence, and cross-site settings for different segmentation tasks. Our
approach achieves favorable performance against existing methods in literature.
Our code and model weights are available at
https://github.com/ShahinaKK/LG_SDG.git.Comment: Accepted at ISBI202
Enhancing Novel Object Detection via Cooperative Foundational Models
In this work, we address the challenging and emergent problem of novel object
detection (NOD), focusing on the accurate detection of both known and novel
object categories during inference. Traditional object detection algorithms are
inherently closed-set, limiting their capability to handle NOD. We present a
novel approach to transform existing closed-set detectors into open-set
detectors. This transformation is achieved by leveraging the complementary
strengths of pre-trained foundational models, specifically CLIP and SAM,
through our cooperative mechanism. Furthermore, by integrating this mechanism
with state-of-the-art open-set detectors such as GDINO, we establish new
benchmarks in object detection performance. Our method achieves 17.42 mAP in
novel object detection and 42.08 mAP for known objects on the challenging LVIS
dataset. Adapting our approach to the COCO OVD split, we surpass the current
state-of-the-art by a margin of 7.2 for novel classes. Our
code is available at
https://github.com/rohit901/cooperative-foundational-models .Comment: Code: https://github.com/rohit901/cooperative-foundational-model
- …