1,584 research outputs found
Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation
Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced
performance across a range of tasks that involve the integration of visual and
linguistic modalities. When CLIP is used for depth estimation tasks, the
patches, divided from the input images, can be combined with a series of
semantic descriptions of the depth information to obtain similarity results.
The coarse estimation of depth is then achieved by weighting and summing the
depth values, called depth bins, corresponding to the predefined semantic
descriptions. The zero-shot approach circumvents the computational and
time-intensive nature of traditional fully-supervised depth estimation methods.
However, this method, utilizing fixed depth bins, may not effectively
generalize as images from different scenes may exhibit distinct depth
distributions. To address this challenge, we propose a few-shot-based method
which learns to adapt the VLMs for monocular depth estimation to balance
training costs and generalization capabilities. Specifically, it assigns
different depth bins for different scenes, which can be selected by the model
during inference. Additionally, we incorporate learnable prompts to preprocess
the input text to convert the easily human-understood text into easily
model-understood vectors and further enhance the performance. With only one
image per scene for training, our extensive experiment results on the NYU V2
and KITTI dataset demonstrate that our method outperforms the previous
state-of-the-art method by up to 10.6\% in terms of MARE.Comment: Accepted by WACV 202
Multi-Domain Adaptation for Image Classification, Depth Estimation, and Semantic Segmentation
The appearance of scenes may change for many reasons, including the viewpoint, the time of day, the weather, and the seasons. Traditionally, deep neural networks are trained and evaluated using images from the same scene and domain to avoid the domain gap. Recent advances in domain adaptation have led to a new type of method that bridges such domain gaps and learns from multiple domains.
This dissertation proposes methods for multi-domain adaptation for various computer vision tasks, including image classification, depth estimation, and semantic segmentation. The first work focuses on semi-supervised domain adaptation. I address this semi-supervised setting and propose to use dynamic feature alignment to address both inter- and intra-domain discrepancy. The second work addresses the task of monocular depth estimation in the multi-domain setting. I propose to address this task with a unified approach that includes adversarial knowledge distillation and uncertainty-guided self-supervised reconstruction. The third work considers the problem of semantic segmentation for aerial imagery with diverse environments and viewing geometries. I present CrossSeg: a novel framework that learns a semantic segmentation network that can generalize well in a cross-scene setting with only a few labeled samples. I believe this line of work can be applicable to many domain adaptation scenarios and aerial applications
Mobile AR Depth Estimation: Challenges & Prospects -- Extended Version
Metric depth estimation plays an important role in mobile augmented reality
(AR). With accurate metric depth, we can achieve more realistic user
interactions such as object placement and occlusion detection. While
specialized hardware like LiDAR demonstrates its promise, its restricted
availability, i.e., only on selected high-end mobile devices, and performance
limitations such as range and sensitivity to the environment, make it less
ideal. Monocular depth estimation, on the other hand, relies solely on mobile
cameras, which are ubiquitous, making it a promising alternative for mobile AR.
In this paper, we investigate the challenges and opportunities of achieving
accurate metric depth estimation in mobile AR. We tested four different
state-of-the-art monocular depth estimation models on a newly introduced
dataset (ARKitScenes) and identified three types of challenges: hard-ware,
data, and model related challenges. Furthermore, our research provides
promising future directions to explore and solve those challenges. These
directions include (i) using more hardware-related information from the mobile
device's camera and other available sensors, (ii) capturing high-quality data
to reflect real-world AR scenarios, and (iii) designing a model architecture to
utilize the new information
3D Object Detection Using Scale Invariant and Feature Reweighting Networks
3D object detection plays an important role in a large number of real-world
applications. It requires us to estimate the localizations and the orientations
of 3D objects in real scenes. In this paper, we present a new network
architecture which focuses on utilizing the front view images and frustum point
clouds to generate 3D detection results. On the one hand, a PointSIFT module is
utilized to improve the performance of 3D segmentation. It can capture the
information from different orientations in space and the robustness to
different scale shapes. On the other hand, our network obtains the useful
features and suppresses the features with less information by a SENet module.
This module reweights channel features and estimates the 3D bounding boxes more
effectively. Our method is evaluated on both KITTI dataset for outdoor scenes
and SUN-RGBD dataset for indoor scenes. The experimental results illustrate
that our method achieves better performance than the state-of-the-art methods
especially when point clouds are highly sparse.Comment: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
Unlike language tasks, where the output space is usually limited to a set of
tokens, the output space of visual tasks is more complicated, making it
difficult to build a unified visual model for various visual tasks. In this
paper, we seek to unify the output space of visual tasks, so that we can also
build a unified model for visual tasks. To this end, we demonstrate a single
unified model that simultaneously handles two typical visual tasks of instance
segmentation and depth estimation, which have discrete/fixed-length and
continuous/varied-length outputs, respectively. We propose several new
techniques that take into account the particularity of visual tasks: 1) Soft
token. We employ soft token to represent the task output. Unlike hard tokens in
the common VQ-VAE which are assigned one-hot to discrete
codebooks/vocabularies, the soft token is assigned softly to the codebook
embeddings. Soft token can improve the accuracy of both the next token
inference and decoding of the task output; 2) Mask augmentation. Many visual
tasks have corruption, undefined or invalid values in label annotations, i.e.,
occluded area of depth maps. We show that a mask augmentation technique can
greatly benefit these tasks. With these new techniques and other designs, we
show that the proposed general-purpose task-solver can perform both instance
segmentation and depth estimation well. Particularly, we achieve 0.279 RMSE on
the specific task of NYUv2 depth estimation, setting a new record on this
benchmark. The general-purpose task-solver, dubbed AiT, is available at
\url{https://github.com/SwinTransformer/AiT}
- …