1,368 research outputs found
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Vision-language pre-training (VLP) has recently proven highly effective for
various uni- and multi-modal downstream applications. However, most existing
end-to-end VLP methods use high-resolution image-text box data to perform well
on fine-grained region-level tasks, such as object detection, segmentation, and
referring expression comprehension. Unfortunately, such high-resolution images
with accurate bounding box annotations are expensive to collect and use for
supervision at scale. In this work, we propose VoLTA (Vision-Language
Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm
that only utilizes image-caption data but achieves fine-grained region-level
image understanding, eliminating the use of expensive box annotations. VoLTA
adopts graph optimal transport-based weakly-supervised alignment on local image
patches and text tokens to germinate an explicit, self-normalized, and
interpretable low-level matching criterion. In addition, VoLTA pushes
multi-modal fusion deep into the uni-modal backbones during pre-training and
removes fusion-specific transformer layers, further reducing memory
requirements. Extensive experiments on a wide range of vision- and
vision-language downstream tasks demonstrate the effectiveness of VoLTA on
fine-grained applications without compromising the coarse-grained downstream
performance, often outperforming methods using significantly more caption and
box annotations
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
Weakly-supervised Part-Attention and Mentored Networks for Vehicle Re-Identification
Vehicle re-identification (Re-ID) aims to retrieve images with the same
vehicle ID across different cameras. Current part-level feature learning
methods typically detect vehicle parts via uniform division, outside tools, or
attention modeling. However, such part features often require expensive
additional annotations and cause sub-optimal performance in case of unreliable
part mask predictions. In this paper, we propose a weakly-supervised
Part-Attention Network (PANet) and Part-Mentored Network (PMNet) for Vehicle
Re-ID. Firstly, PANet localizes vehicle parts via part-relevant channel
recalibration and cluster-based mask generation without vehicle part
supervisory information. Secondly, PMNet leverages teacher-student guided
learning to distill vehicle part-specific features from PANet and performs
multi-scale global-part feature extraction. During inference, PMNet can
adaptively extract discriminative part features without part localization by
PANet, preventing unstable part mask predictions. We address this Re-ID issue
as a multi-task problem and adopt Homoscedastic Uncertainty to learn the
optimal weighing of ID losses. Experiments are conducted on two public
benchmarks, showing that our approach outperforms recent methods, which require
no extra annotations by an average increase of 3.0% in CMC@5 on VehicleID and
over 1.4% in mAP on VeRi776. Moreover, our method can extend to the occluded
vehicle Re-ID task and exhibits good generalization ability.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Deep learning in remote sensing: a review
Standing at the paradigm shift towards data-intensive science, machine
learning techniques are becoming increasingly important. In particular, as a
major breakthrough in the field, deep learning has proven as an extremely
powerful tool in many fields. Shall we embrace deep learning as the key to all?
Or, should we resist a 'black-box' solution? There are controversial opinions
in the remote sensing community. In this article, we analyze the challenges of
using deep learning for remote sensing data analysis, review the recent
advances, and provide resources to make deep learning in remote sensing
ridiculously simple to start with. More importantly, we advocate remote sensing
scientists to bring their expertise into deep learning, and use it as an
implicit general model to tackle unprecedented large-scale influential
challenges, such as climate change and urbanization.Comment: Accepted for publication IEEE Geoscience and Remote Sensing Magazin
- …