97 research outputs found
LIGHT: Joint Individual Building Extraction and Height Estimation from Satellite Images through a Unified Multitask Learning Network
Building extraction and height estimation are two important basic tasks in
remote sensing image interpretation, which are widely used in urban planning,
real-world 3D construction, and other fields. Most of the existing research
regards the two tasks as independent studies. Therefore the height information
cannot be fully used to improve the accuracy of building extraction and vice
versa. In this work, we combine the individuaL buIlding extraction and heiGHt
estimation through a unified multiTask learning network (LIGHT) for the first
time, which simultaneously outputs a height map, bounding boxes, and a
segmentation mask map of buildings. Specifically, LIGHT consists of an instance
segmentation branch and a height estimation branch. In particular, so as to
effectively unify multi-scale feature branches and alleviate feature spans
between branches, we propose a Gated Cross Task Interaction (GCTI) module that
can efficiently perform feature interaction between branches. Experiments on
the DFC2023 dataset show that our LIGHT can achieve superior performance, and
our GCTI module with ResNet101 as the backbone can significantly improve the
performance of multitask learning by 2.8% AP50 and 6.5% delta1, respectively
Self-guided Few-shot Semantic Segmentation for Remote Sensing Imagery Based on Large Vision Models
The Segment Anything Model (SAM) exhibits remarkable versatility and
zero-shot learning abilities, owing largely to its extensive training data
(SA-1B). Recognizing SAM's dependency on manual guidance given its
category-agnostic nature, we identified unexplored potential within few-shot
semantic segmentation tasks for remote sensing imagery. This research
introduces a structured framework designed for the automation of few-shot
semantic segmentation. It utilizes the SAM model and facilitates a more
efficient generation of semantically discernible segmentation outcomes. Central
to our methodology is a novel automatic prompt learning approach, leveraging
prior guided masks to produce coarse pixel-wise prompts for SAM. Extensive
experiments on the DLRSD datasets underline the superiority of our approach,
outperforming other available few-shot methodologies
Not Just Learning from Others but Relying on Yourself: A New Perspective on Few-Shot Segmentation in Remote Sensing
Few-shot segmentation (FSS) is proposed to segment unknown class targets with
just a few annotated samples. Most current FSS methods follow the paradigm of
mining the semantics from the support images to guide the query image
segmentation. However, such a pattern of `learning from others' struggles to
handle the extreme intra-class variation, preventing FSS from being directly
generalized to remote sensing scenes. To bridge the gap of intra-class
variance, we develop a Dual-Mining network named DMNet for cross-image mining
and self-mining, meaning that it no longer focuses solely on support images but
pays more attention to the query image itself. Specifically, we propose a
Class-public Region Mining (CPRM) module to effectively suppress irrelevant
feature pollution by capturing the common semantics between the support-query
image pair. The Class-specific Region Mining (CSRM) module is then proposed to
continuously mine the class-specific semantics of the query image itself in a
`filtering' and `purifying' manner. In addition, to prevent the co-existence of
multiple classes in remote sensing scenes from exacerbating the collapse of FSS
generalization, we also propose a new Known-class Meta Suppressor (KMS) module
to suppress the activation of known-class objects in the sample. Extensive
experiments on the iSAID and LoveDA remote sensing datasets have demonstrated
that our method sets the state-of-the-art with a minimum number of model
parameters. Significantly, our model with the backbone of Resnet-50 achieves
the mIoU of 49.58% and 51.34% on iSAID under 1-shot and 5-shot settings,
outperforming the state-of-the-art method by 1.8% and 1.12%, respectively. The
code is publicly available at https://github.com/HanboBizl/DMNet.Comment: accepted to IEEE TGR
Semantic Segmentation for Point Cloud Scenes via Dilated Graph Feature Aggregation and Pyramid Decoders
Semantic segmentation of point clouds generates comprehensive understanding
of scenes through densely predicting the category for each point. Due to the
unicity of receptive field, semantic segmentation of point clouds remains
challenging for the expression of multi-receptive field features, which brings
about the misclassification of instances with similar spatial structures. In
this paper, we propose a graph convolutional network DGFA-Net rooted in dilated
graph feature aggregation (DGFA), guided by multi-basis aggregation loss
(MALoss) calculated through Pyramid Decoders. To configure multi-receptive
field features, DGFA which takes the proposed dilated graph convolution
(DGConv) as its basic building block, is designed to aggregate multi-scale
feature representation by capturing dilated graphs with various receptive
regions. By simultaneously considering penalizing the receptive field
information with point sets of different resolutions as calculation bases, we
introduce Pyramid Decoders driven by MALoss for the diversity of receptive
field bases. Combining these two aspects, DGFA-Net significantly improves the
segmentation performance of instances with similar spatial structures.
Experiments on S3DIS, ShapeNetPart and Toronto-3D show that DGFA-Net
outperforms the baseline approach, achieving a new state-of-the-art
segmentation performance.Comment: accepted to AAAI Workshop 202
Breaking Immutable: Information-Coupled Prototype Elaboration for Few-Shot Object Detection
Few-shot object detection, expecting detectors to detect novel classes with a
few instances, has made conspicuous progress. However, the prototypes extracted
by existing meta-learning based methods still suffer from insufficient
representative information and lack awareness of query images, which cannot be
adaptively tailored to different query images. Firstly, only the support images
are involved for extracting prototypes, resulting in scarce perceptual
information of query images. Secondly, all pixels of all support images are
treated equally when aggregating features into prototype vectors, thus the
salient objects are overwhelmed by the cluttered background. In this paper, we
propose an Information-Coupled Prototype Elaboration (ICPE) method to generate
specific and representative prototypes for each query image. Concretely, a
conditional information coupling module is introduced to couple information
from the query branch to the support branch, strengthening the query-perceptual
information in support features. Besides, we design a prototype dynamic
aggregation module that dynamically adjusts intra-image and inter-image
aggregation weights to highlight the salient information useful for detecting
query images. Experimental results on both Pascal VOC and MS COCO demonstrate
that our method achieves state-of-the-art performance in almost all settings.Comment: Accepted by AAAI202
Learning to Evaluate Performance of Multi-modal Semantic Localization
Semantic localization (SeLo) refers to the task of obtaining the most
relevant locations in large-scale remote sensing (RS) images using semantic
information such as text. As an emerging task based on cross-modal retrieval,
SeLo achieves semantic-level retrieval with only caption-level annotation,
which demonstrates its great potential in unifying downstream tasks. Although
SeLo has been carried out successively, but there is currently no work has
systematically explores and analyzes this urgent direction. In this paper, we
thoroughly study this field and provide a complete benchmark in terms of
metrics and testdata to advance the SeLo task. Firstly, based on the
characteristics of this task, we propose multiple discriminative evaluation
metrics to quantify the performance of the SeLo task. The devised significant
area proportion, attention shift distance, and discrete attention distance are
utilized to evaluate the generated SeLo map from pixel-level and region-level.
Next, to provide standard evaluation data for the SeLo task, we contribute a
diverse, multi-semantic, multi-objective Semantic Localization Testset
(AIR-SLT). AIR-SLT consists of 22 large-scale RS images and 59 test cases with
different semantics, which aims to provide a comprehensive evaluations for
retrieval models. Finally, we analyze the SeLo performance of RS cross-modal
retrieval models in detail, explore the impact of different variables on this
task, and provide a complete benchmark for the SeLo task. We have also
established a new paradigm for RS referring expression comprehension, and
demonstrated the great advantage of SeLo in semantics through combining it with
tasks such as detection and road extraction. The proposed evaluation metrics,
semantic localization testsets, and corresponding scripts have been open to
access at github.com/xiaoyuan1996/SemanticLocalizationMetrics .Comment: 19 pages, 11 figure
Elevation Estimation-Driven Building 3D Reconstruction from Single-View Remote Sensing Imagery
Building 3D reconstruction from remote sensing images has a wide range of
applications in smart cities, photogrammetry and other fields. Methods for
automatic 3D urban building modeling typically employ multi-view images as
input to algorithms to recover point clouds and 3D models of buildings.
However, such models rely heavily on multi-view images of buildings, which are
time-intensive and limit the applicability and practicality of the models. To
solve these issues, we focus on designing an efficient DSM estimation-driven
reconstruction framework (Building3D), which aims to reconstruct 3D building
models from the input single-view remote sensing image. First, we propose a
Semantic Flow Field-guided DSM Estimation (SFFDE) network, which utilizes the
proposed concept of elevation semantic flow to achieve the registration of
local and global features. Specifically, in order to make the network semantics
globally aware, we propose an Elevation Semantic Globalization (ESG) module to
realize the semantic globalization of instances. Further, in order to alleviate
the semantic span of global features and original local features, we propose a
Local-to-Global Elevation Semantic Registration (L2G-ESR) module based on
elevation semantic flow. Our Building3D is rooted in the SFFDE network for
building elevation prediction, synchronized with a building extraction network
for building masks, and then sequentially performs point cloud reconstruction,
surface reconstruction (or CityGML model reconstruction). On this basis, our
Building3D can optionally generate CityGML models or surface mesh models of the
buildings. Extensive experiments on ISPRS Vaihingen and DFC2019 datasets on the
DSM estimation task show that our SFFDE significantly improves upon
state-of-the-arts. Furthermore, our Building3D achieves impressive results in
the 3D point cloud and 3D model reconstruction process
Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes
Modern autonomous driving systems are typically divided into three main
tasks: perception, prediction, and planning. The planning task involves
predicting the trajectory of the ego vehicle based on inputs from both internal
intention and the external environment, and manipulating the vehicle
accordingly. Most existing works evaluate their performance on the nuScenes
dataset using the L2 error and collision rate between the predicted
trajectories and the ground truth. In this paper, we reevaluate these existing
evaluation metrics and explore whether they accurately measure the superiority
of different methods. Specifically, we design an MLP-based method that takes
raw sensor data (e.g., past trajectory, velocity, etc.) as input and directly
outputs the future trajectory of the ego vehicle, without using any perception
or prediction information such as camera images or LiDAR. Our simple method
achieves similar end-to-end planning performance on the nuScenes dataset with
other perception-based methods, reducing the average L2 error by about 20%.
Meanwhile, the perception-based methods have an advantage in terms of collision
rate. We further conduct in-depth analysis and provide new insights into the
factors that are critical for the success of the planning task on nuScenes
dataset. Our observation also indicates that we need to rethink the current
open-loop evaluation scheme of end-to-end autonomous driving in nuScenes. Codes
are available at https://github.com/E2E-AD/AD-MLP.Comment: Technical report. Code is availabl
- …