80 research outputs found
BuilDiff: 3D Building Shape Generation using Single-Image Conditional Point Cloud Diffusion Models
3D building generation with low data acquisition costs, such as single
image-to-3D, becomes increasingly important. However, most of the existing
single image-to-3D building creation works are restricted to those images with
specific viewing angles, hence they are difficult to scale to general-view
images that commonly appear in practical cases. To fill this gap, we propose a
novel 3D building shape generation method exploiting point cloud diffusion
models with image conditioning schemes, which demonstrates flexibility to the
input images. By cooperating two conditional diffusion models and introducing a
regularization strategy during denoising process, our method is able to
synthesize building roofs while maintaining the overall structures. We validate
our framework on two newly built datasets and extensive experiments show that
our method outperforms previous works in terms of building generation quality.Comment: 10 pages, 6 figures, accepted to ICCVW202
HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images
Visual question answering (VQA) is an important and challenging multimodal
task in computer vision. Recently, a few efforts have been made to bring VQA
task to aerial images, due to its potential real-world applications in disaster
monitoring, urban planning, and digital earth product generation. However, not
only the huge variation in the appearance, scale and orientation of the
concepts in aerial images, but also the scarcity of the well-annotated datasets
restricts the development of VQA in this domain. In this paper, we introduce a
new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024
pixels and semi-automatically generated 1070240 QA pairs. To benchmark the
understanding capability of VQA models for aerial images, we evaluate the
relevant methods on HRVQA. Moreover, we propose a novel model, GFTransformer,
with gated attention modules and a mutual fusion module. The experiments show
that the proposed dataset is quite challenging, especially the specific
attribute related questions. Our method achieves superior performance in
comparison to the previous state-of-the-art approaches. The dataset and the
source code will be released at https://hrvqa.nl/
UAVPal:A New Dataset for Semantic Segmentation in Complex Urban Landscape with Efficient Multiscale Segmentation
Semantic segmentation has recently emerged as a prominent area of interest in Earth observation. Several semantic segmentation datasets already exist, facilitating comparisons among different methods in complex urban scenes. However, most open high-resolution urban datasets are geographically skewed toward Europe and North America, while coverage of Southeast Asia is very limited. The considerable variation in city designs worldwide presents an obstacle to the applicability of computer vision models, especially when the training dataset lacks significant diversity. On the other hand, naively applying computationally expensive models leads to inefficacies and sometimes poor performance. To tackle the lack of data diversity, we introduce a new UAVPal dataset of complex urban scenes from the city of Bhopal, India. We complement this by introducing a novel dense predictor head and demonstrate that a well-designed head can efficiently take advantage of the multiscale features to enhance the benefits of a strong feature extractor backbone. We design our segmentation head to learn the importance of features at various scales for each individual class and refine the final dense prediction accordingly. We tested our proposed head with a state-of-the-art backbone on multiple UAV datasets and a high-resolution satellite image dataset for LULC classification. We observed improved intersection over union (IoU) in various classes and up to 2 better mean IoU. Apart from the performance improvements, we also observed nearly 50 reduction in computing operations required when using the proposed head compared to the traditional segmentation head.</p
Flow-based GAN for 3D Point Cloud Generation from a Single Image
Generating a 3D point cloud from a single 2D image is of great importance for
3D scene understanding applications. To reconstruct the whole 3D shape of the
object shown in the image, the existing deep learning based approaches use
either explicit or implicit generative modeling of point clouds, which,
however, suffer from limited quality. In this work, we aim to alleviate this
issue by introducing a hybrid explicit-implicit generative modeling scheme,
which inherits the flow-based explicit generative models for sampling point
clouds with arbitrary resolutions while improving the detailed 3D structures of
point clouds by leveraging the implicit generative adversarial networks (GANs).
We evaluate on the large-scale synthetic dataset ShapeNet, with the
experimental results demonstrating the superior performance of the proposed
method. In addition, the generalization ability of our method is demonstrated
by performing on cross-category synthetic images as well as by testing on real
images from PASCAL3D+ dataset.Comment: 13 pages, 5 figures, accepted to BMVC202
Unsupervised Domain Adaptation for Multispectral Pedestrian Detection
Multimodal information (e.g., visible and thermal) can generate robust
pedestrian detections to facilitate around-the-clock computer vision
applications, such as autonomous driving and video surveillance. However, it
still remains a crucial challenge to train a reliable detector working well in
different multispectral pedestrian datasets without manual annotations. In this
paper, we propose a novel unsupervised domain adaptation framework for
multispectral pedestrian detection, by iteratively generating pseudo
annotations and updating the parameters of our designed multispectral
pedestrian detector on target domain. Pseudo annotations are generated using
the detector trained on source domain, and then updated by fixing the
parameters of detector and minimizing the cross entropy loss without
back-propagation. Training labels are generated using the pseudo annotations by
considering the characteristics of similarity and complementarity between
well-aligned visible and infrared image pairs. The parameters of detector are
updated using the generated labels by minimizing our defined multi-detection
loss function with back-propagation. The optimal parameters of detector can be
obtained after iteratively updating the pseudo annotations and parameters.
Experimental results show that our proposed unsupervised multimodal domain
adaptation method achieves significantly higher detection performance than the
approach without domain adaptation, and is competitive with the supervised
multispectral pedestrian detectors
- …