21 research outputs found
Feature pyramid transformer
Feature interactions across space and scales underpin modern visual
recognition systems because they introduce beneficial visual contexts.
Conventionally, spatial contexts are passively hidden in the CNN's increasing
receptive fields or actively encoded by non-local convolution. Yet, the
non-local spatial interactions are not across scales, and thus they fail to
capture the non-local contexts of objects (or parts) residing in different
scales. To this end, we propose a fully active feature interaction across both
space and scales, called Feature Pyramid Transformer (FPT). It transforms any
feature pyramid into another feature pyramid of the same size but with richer
contexts, by using three specially designed transformers in self-level,
top-down, and bottom-up interaction fashion. FPT serves as a generic visual
backbone with fair computational overhead. We conduct extensive experiments in
both instance-level (i.e., object detection and instance segmentation) and
pixel-level segmentation tasks, using various backbones and head networks, and
observe consistent improvement over all the baselines and the state-of-the-art
methods.Comment: Published at the European Conference on Computer Vision, 202
Advances in Object and Activity Detection in Remote Sensing Imagery
The recent revolution in deep learning has enabled considerable development in the fields of object and activity detection. Visual object detection tries to find objects of target classes with precise localisation in an image and assign each object instance a corresponding class label. At the same time, activity recognition aims to determine the actions or activities of an agent or group of agents based on sensor or video observation data. It is a very important and challenging problem to detect, identify, track, and understand the behaviour of objects through images and videos taken by various cameras. Together, objects and their activity recognition in imaging data captured by remote sensing platforms is a highly dynamic and challenging research topic. During the last decade, there has been significant growth in the number of publications in the field of object and activity recognition. In particular, many researchers have proposed application domains to identify objects and their specific behaviours from air and spaceborne imagery. This Special Issue includes papers that explore novel and challenging topics for object and activity detection in remote sensing images and videos acquired by diverse platforms
Vectorizing Planar Roof Structure From Very High Resolution Remote Sensing Images Using Transformers
Grasping the roof structure of a building is a key part of building reconstruction. Directly predicting the geometric structure of the roof from a raster image to a vectorized representation, however, remains challenging. This paper introduces an efficient and accurate parsing method based upon a vision Transformer we dubbed Roof-Former. Our method consists of three steps: 1) Image encoder and edge node initialization, 2) Image feature fusion with an enhanced segmentation refinement branch, and 3) Edge filtering and structural reasoning. The vertex and edge heat map F1-scores have increased by 2.0% and 1.9% on the VWB dataset when compared to HEAT. Additionally, qualitative evaluations suggest that our method is superior to the current state-of-the-art. It indicates effectiveness for extracting global image information and maintaining the consistency and topological validity of the roof structure.</p
SwinV2DNet: Pyramid and Self-Supervision Compounded Feature Learning for Remote Sensing Images Change Detection
Among the current mainstream change detection networks, transformer is
deficient in the ability to capture accurate low-level details, while
convolutional neural network (CNN) is wanting in the capacity to understand
global information and establish remote spatial relationships. Meanwhile, both
of the widely used early fusion and late fusion frameworks are not able to well
learn complete change features. Therefore, based on swin transformer V2 (Swin
V2) and VGG16, we propose an end-to-end compounded dense network SwinV2DNet to
inherit the advantages of both transformer and CNN and overcome the
shortcomings of existing networks in feature learning. Firstly, it captures the
change relationship features through the densely connected Swin V2 backbone,
and provides the low-level pre-changed and post-changed features through a CNN
branch. Based on these three change features, we accomplish accurate change
detection results. Secondly, combined with transformer and CNN, we propose
mixed feature pyramid (MFP) which provides inter-layer interaction information
and intra-layer multi-scale information for complete feature learning. MFP is a
plug and play module which is experimentally proven to be also effective in
other change detection networks. Further more, we impose a self-supervision
strategy to guide a new CNN branch, which solves the untrainable problem of the
CNN branch and provides the semantic change information for the features of
encoder. The state-of-the-art (SOTA) change detection scores and fine-grained
change maps were obtained compared with other advanced methods on four commonly
used public remote sensing datasets. The code is available at
https://github.com/DalongZ/SwinV2DNet
FETNet: Feature exchange transformer network for RGB-D object detection
In RGB-D object detection, due to the inherent difference between the RGB and
Depth modalities, it remains challenging to simultaneously leverage sensed photometric and depth information. In this paper, to address this issue, we propose a Feature
Exchange Transformer Network (FETNet), which consists of two well-designed components: the Feature Exchange Module (FEM), and the Multi-modal Vision Transformer
(MViT). Specially, we propose the FEM to exchange part of the channels between RGB
and depth features at each backbone stage, which facilitates the information flow, and
bridges the gap, between the two modalities. Inspired by the success of Vision Transformer (ViT), we develop the variant MViT to effectively fuse multi-modal features and exploit the attention between the RGB and depth features. Different from previous methods developing from specified RGB detection algorithm, our proposal is generic. Extensive experiments prove that, when the proposed modules are integrated into mainstream RGB object detection methods, their RGB-D counterparts can obtain significant performance gains. Moreover, our FETNet surpasses state-of-the-art RGB-D detectors by 7.0% mAP on SUN RGB-D and 1.7% mAP on NYU Depth v2, which also well demonstrates
the effectiveness of the proposed method
Event transformer FlowNet for optical flow estimation
Event cameras are bioinspired sensors that produce asynchronous and sparse streams of events at image locations where intensity change is detected. They can detect fast motion with low latency, high dynamic range, and low power consumption. Over the past decade, efforts have been conducted in developing solutions with event cameras for robotics applications. In this work, we address their use for fast and robust computation of optical flow. We present ET-FlowNet, a hybrid RNN-ViT architecture for optical flow estimation. Visual transformers (ViTs) are ideal candidates for the learning of global context in visual tasks, and we argue that rigid body motion is a prime case for the use of ViTs since long-range dependencies in the image hold during rigid body motion. We perform end-to-end training with self-supervised learning method. Our results show comparable and in some cases exceeding performance with state-of-the-art coarse-to-fine event-based optical flow estimation.This work was supported by projects EBSLAM DPI2017-89564-P and EBCON PID2020-119244GB-I00 funded by CIN/AEI/10.13039/501100011033 and by an FI AGAUR PhD grant to Yi Tian.Postprint (published version
Spatial-Spectral Transformer for Hyperspectral Image Denoising
Hyperspectral image (HSI) denoising is a crucial preprocessing procedure for
the subsequent HSI applications. Unfortunately, though witnessing the
development of deep learning in HSI denoising area, existing convolution-based
methods face the trade-off between computational efficiency and capability to
model non-local characteristics of HSI. In this paper, we propose a
Spatial-Spectral Transformer (SST) to alleviate this problem. To fully explore
intrinsic similarity characteristics in both spatial dimension and spectral
dimension, we conduct non-local spatial self-attention and global spectral
self-attention with Transformer architecture. The window-based spatial
self-attention focuses on the spatial similarity beyond the neighboring region.
While, spectral self-attention exploits the long-range dependencies between
highly correlative bands. Experimental results show that our proposed method
outperforms the state-of-the-art HSI denoising methods in quantitative quality
and visual results
RGB-X Object Detection via Scene-Specific Fusion Modules
Multimodal deep sensor fusion has the potential to enable autonomous vehicles
to visually understand their surrounding environments in all weather
conditions. However, existing deep sensor fusion methods usually employ
convoluted architectures with intermingled multimodal features, requiring large
coregistered multimodal datasets for training. In this work, we present an
efficient and modular RGB-X fusion network that can leverage and fuse
pretrained single-modal models via scene-specific fusion modules, thereby
enabling joint input-adaptive network architectures to be created using small,
coregistered multimodal datasets. Our experiments demonstrate the superiority
of our method compared to existing works on RGB-thermal and RGB-gated datasets,
performing fusion using only a small amount of additional parameters. Our code
is available at https://github.com/dsriaditya999/RGBXFusion.Comment: Accepted to 2024 IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV 2024