71 research outputs found
CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model
Fusion of a panchromatic (PAN) image and corresponding multispectral (MS)
image is also known as pansharpening, which aims to combine abundant spatial
details of PAN and spectral information of MS. Due to the absence of
high-resolution MS images, available deep-learning-based methods usually follow
the paradigm of training at reduced resolution and testing at both reduced and
full resolution. When taking original MS and PAN images as inputs, they always
obtain sub-optimal results due to the scale variation. In this paper, we
propose to explore the self-supervised representation of pansharpening by
designing a cross-predictive diffusion model, named CrossDiff. It has two-stage
training. In the first stage, we introduce a cross-predictive pretext task to
pre-train the UNet structure based on conditional DDPM, while in the second
stage, the encoders of the UNets are frozen to directly extract spatial and
spectral features from PAN and MS, and only the fusion head is trained to adapt
for pansharpening task. Extensive experiments show the effectiveness and
superiority of the proposed model compared with state-of-the-art supervised and
unsupervised methods. Besides, the cross-sensor experiments also verify the
generalization ability of proposed self-supervised representation learners for
other satellite's datasets. We will release our code for reproducibility
DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion
Human de-occlusion, which aims to infer the appearance of invisible human
parts from an occluded image, has great value in many human-related tasks, such
as person re-id, and intention inference. To address this task, this paper
proposes a dynamic mask-aware transformer (DMAT), which dynamically augments
information from human regions and weakens that from occlusion. First, to
enhance token representation, we design an expanded convolution head with
enlarged kernels, which captures more local valid context and mitigates the
influence of surrounding occlusion. To concentrate on the visible human parts,
we propose a novel dynamic multi-head human-mask guided attention mechanism
through integrating multiple masks, which can prevent the de-occluded regions
from assimilating to the background. Besides, a region upsampling strategy is
utilized to alleviate the impact of occlusion on interpolated images. During
model learning, an amodal loss is developed to further emphasize the recovery
effect of human regions, which also refines the model's convergence. Extensive
experiments on the AHP dataset demonstrate its superior performance compared to
recent state-of-the-art methods
MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization
Multispectral pedestrian detection is an important task for many
around-the-clock applications, since the visible and thermal modalities can
provide complementary information especially under low light conditions. Most
of the available multispectral pedestrian detectors are based on non-end-to-end
detectors, while in this paper, we propose MultiSpectral pedestrian DEtection
TRansformer (MS-DETR), an end-to-end multispectral pedestrian detector, which
extends DETR into the field of multi-modal detection. MS-DETR consists of two
modality-specific backbones and Transformer encoders, followed by a multi-modal
Transformer decoder, and the visible and thermal features are fused in the
multi-modal Transformer decoder. To well resist the misalignment between
multi-modal images, we design a loosely coupled fusion strategy by sparsely
sampling some keypoints from multi-modal features independently and fusing them
with adaptively learned attention weights. Moreover, based on the insight that
not only different modalities, but also different pedestrian instances tend to
have different confidence scores to final detection, we further propose an
instance-aware modality-balanced optimization strategy, which preserves visible
and thermal decoder branches and aligns their predicted slots through an
instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance
on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code
is available at https://github.com/YinghuiXing/MS-DETR
(E)-1-[4-(Dimethylamino)benzylidene]thiosemicarbazide
In the title molecule, C10H14N4S, the thiorea plane and benzene ring form a dihedral angle of 16.0 (3) Å. In the crystal structure, intermolecular N—H⋯S hydrogen bonds link the molecules into ribbons extended in the [100] direction; these incorporate inversion dimers
Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged Object Detection
Camouflaged object detection (COD), aiming to segment camouflaged objects
which exhibit similar patterns with the background, is a challenging task. Most
existing works are dedicated to establishing specialized modules to identify
camouflaged objects with complete and fine details, while the boundary can not
be well located for the lack of object-related semantics. In this paper, we
propose a novel ``pre-train, adapt and detect" paradigm to detect camouflaged
objects. By introducing a large pre-trained model, abundant knowledge learned
from massive multi-modal data can be directly transferred to COD. A lightweight
parallel adapter is inserted to adjust the features suitable for the downstream
COD task. Extensive experiments on four challenging benchmark datasets
demonstrate that our method outperforms existing state-of-the-art COD models by
large margins. Moreover, we design a multi-task learning scheme for tuning the
adapter to exploit the shareable knowledge across different semantic classes.
Comprehensive experimental results showed that the generalization ability of
our model can be substantially improved with multi-task adapter initialization
on source tasks and multi-task adaptation on target tasks
Ground-to-Aerial Person Search: Benchmark Dataset and Approach
In this work, we construct a large-scale dataset for Ground-to-Aerial Person
Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding
boxes for 2,644 identities appearing in both of the UAVs and ground
surveillance cameras. To our knowledge, this is the first dataset for
cross-platform intelligent surveillance applications, where the UAVs could work
as a powerful complement for the ground surveillance cameras. To more
realistically simulate the actual cross-platform Ground-to-Aerial surveillance
scenarios, the surveillance cameras are fixed about 2 meters above the ground,
while the UAVs capture videos of persons at different location, with a variety
of view-angles, flight attitudes and flight modes. Therefore, the dataset has
the following unique characteristics: 1) drastic view-angle changes between
query and gallery person images from cross-platform cameras; 2) diverse
resolutions, poses and views of the person images under 9 rich real-world
scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed
analysis about current two-step and end-to-end person search methods, and
further propose a simple yet effective knowledge distillation scheme on the
head of the ReID network, which achieves state-of-the-art performances on both
of the G2APS and the previous two public person search datasets, i.e., PRW and
CUHK-SYSU. The dataset and source code available on
\url{https://github.com/yqc123456/HKD_for_person_search}.Comment: Accepted by ACM MM 202
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
With the emergence of large pre-trained vison-language model like CLIP,
transferable representations can be adapted to a wide range of downstream tasks
via prompt tuning. Prompt tuning tries to probe the beneficial information for
downstream tasks from the general knowledge stored in the pre-trained model. A
recently proposed method named Context Optimization (CoOp) introduces a set of
learnable vectors as text prompt from the language side. However, tuning the
text prompt alone can only adjust the synthesized "classifier", while the
computed visual features of the image encoder can not be affected , thus
leading to sub-optimal solutions. In this paper, we propose a novel
Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual
prompts simultaneously. To make the final image feature concentrate more on the
target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is
further proposed in our DPT, where the class-aware visual prompt is generated
dynamically by performing the cross attention between text prompts features and
image patch token embeddings to encode both the downstream task-related
information and visual instance information. Extensive experimental results on
11 datasets demonstrate the effectiveness and generalization ability of the
proposed method. Our code is available in https://github.com/fanrena/DPT.Comment: 12 pages, 7 figure
SaliencyGAN: Deep Learning Semisupervised Salient Object Detection in the Fog of IoT
In modern Internet of Things (IoT), visual analysis and predictions are often performed by deep learning models. Salient object detection (SOD) is a fundamental preprocessing for these applications. Executing SOD on the fog devices is a challenging task due to the diversity of data and fog devices. To adopt convolutional neural networks (CNN) on fog-cloud infrastructures for SOD-based applications, we introduce a semisupervised adversarial learning method in this article. The proposed model, named as SaliencyGAN, is empowered by a novel concatenated generative adversarial network (GAN) framework with partially shared parameters. The backbone CNN can be chosen flexibly based on the specific devices and applications. In the meanwhile, our method uses both the labeled and unlabeled data from different problem domains for training. Using multiple popular benchmark datasets, we compared state-of-the-art baseline methods to our SaliencyGAN obtained with 10-100% labeled training data. SaliencyGAN gained performance comparable to the supervised baselines when the percentage of labeled data reached 30%, and outperformed the weakly supervised and unsupervised baselines. Furthermore, our ablation study shows that SaliencyGAN were more robust to the common “mode missing” (or “mode collapse”) issue compared to the selected popular GAN models. The visualized ablation results have proved that SaliencyGAN learned a better estimation of data distributions. To the best of our knowledge, this is the first IoT-oriented semisupervised SOD method
Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID
Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to
match pedestrian images of the same identity from different modalities without
annotations. Existing works mainly focus on alleviating the modality gap by
aligning instance-level features of the unlabeled samples. However, the
relationships between cross-modality clusters are not well explored. To this
end, we propose a novel bilateral cluster matching-based learning framework to
reduce the modality gap by matching cross-modality clusters. Specifically, we
design a Many-to-many Bilateral Cross-Modality Cluster Matching (MBCCM)
algorithm through optimizing the maximum matching problem in a bipartite graph.
Then, the matched pairwise clusters utilize shared visible and infrared
pseudo-labels during the model training. Under such a supervisory signal, a
Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework
is proposed to align features jointly at a cluster-level. Meanwhile, the
cross-modality Consistency Constraint (CC) is proposed to explicitly reduce the
large modality discrepancy. Extensive experiments on the public SYSU-MM01 and
RegDB datasets demonstrate the effectiveness of the proposed method, surpassing
state-of-the-art approaches by a large margin of 8.76% mAP on average
- …