58 research outputs found
MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization
Multispectral pedestrian detection is an important task for many
around-the-clock applications, since the visible and thermal modalities can
provide complementary information especially under low light conditions. Most
of the available multispectral pedestrian detectors are based on non-end-to-end
detectors, while in this paper, we propose MultiSpectral pedestrian DEtection
TRansformer (MS-DETR), an end-to-end multispectral pedestrian detector, which
extends DETR into the field of multi-modal detection. MS-DETR consists of two
modality-specific backbones and Transformer encoders, followed by a multi-modal
Transformer decoder, and the visible and thermal features are fused in the
multi-modal Transformer decoder. To well resist the misalignment between
multi-modal images, we design a loosely coupled fusion strategy by sparsely
sampling some keypoints from multi-modal features independently and fusing them
with adaptively learned attention weights. Moreover, based on the insight that
not only different modalities, but also different pedestrian instances tend to
have different confidence scores to final detection, we further propose an
instance-aware modality-balanced optimization strategy, which preserves visible
and thermal decoder branches and aligns their predicted slots through an
instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance
on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code
is available at https://github.com/YinghuiXing/MS-DETR
(E)-1-[4-(Dimethylamino)benzylidene]thiosemicarbazide
In the title molecule, C10H14N4S, the thiorea plane and benzene ring form a dihedral angle of 16.0 (3) Å. In the crystal structure, intermolecular N—H⋯S hydrogen bonds link the molecules into ribbons extended in the [100] direction; these incorporate inversion dimers
Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged Object Detection
Camouflaged object detection (COD), aiming to segment camouflaged objects
which exhibit similar patterns with the background, is a challenging task. Most
existing works are dedicated to establishing specialized modules to identify
camouflaged objects with complete and fine details, while the boundary can not
be well located for the lack of object-related semantics. In this paper, we
propose a novel ``pre-train, adapt and detect" paradigm to detect camouflaged
objects. By introducing a large pre-trained model, abundant knowledge learned
from massive multi-modal data can be directly transferred to COD. A lightweight
parallel adapter is inserted to adjust the features suitable for the downstream
COD task. Extensive experiments on four challenging benchmark datasets
demonstrate that our method outperforms existing state-of-the-art COD models by
large margins. Moreover, we design a multi-task learning scheme for tuning the
adapter to exploit the shareable knowledge across different semantic classes.
Comprehensive experimental results showed that the generalization ability of
our model can be substantially improved with multi-task adapter initialization
on source tasks and multi-task adaptation on target tasks
Ground-to-Aerial Person Search: Benchmark Dataset and Approach
In this work, we construct a large-scale dataset for Ground-to-Aerial Person
Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding
boxes for 2,644 identities appearing in both of the UAVs and ground
surveillance cameras. To our knowledge, this is the first dataset for
cross-platform intelligent surveillance applications, where the UAVs could work
as a powerful complement for the ground surveillance cameras. To more
realistically simulate the actual cross-platform Ground-to-Aerial surveillance
scenarios, the surveillance cameras are fixed about 2 meters above the ground,
while the UAVs capture videos of persons at different location, with a variety
of view-angles, flight attitudes and flight modes. Therefore, the dataset has
the following unique characteristics: 1) drastic view-angle changes between
query and gallery person images from cross-platform cameras; 2) diverse
resolutions, poses and views of the person images under 9 rich real-world
scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed
analysis about current two-step and end-to-end person search methods, and
further propose a simple yet effective knowledge distillation scheme on the
head of the ReID network, which achieves state-of-the-art performances on both
of the G2APS and the previous two public person search datasets, i.e., PRW and
CUHK-SYSU. The dataset and source code available on
\url{https://github.com/yqc123456/HKD_for_person_search}.Comment: Accepted by ACM MM 202
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
With the emergence of large pre-trained vison-language model like CLIP,
transferable representations can be adapted to a wide range of downstream tasks
via prompt tuning. Prompt tuning tries to probe the beneficial information for
downstream tasks from the general knowledge stored in the pre-trained model. A
recently proposed method named Context Optimization (CoOp) introduces a set of
learnable vectors as text prompt from the language side. However, tuning the
text prompt alone can only adjust the synthesized "classifier", while the
computed visual features of the image encoder can not be affected , thus
leading to sub-optimal solutions. In this paper, we propose a novel
Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual
prompts simultaneously. To make the final image feature concentrate more on the
target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is
further proposed in our DPT, where the class-aware visual prompt is generated
dynamically by performing the cross attention between text prompts features and
image patch token embeddings to encode both the downstream task-related
information and visual instance information. Extensive experimental results on
11 datasets demonstrate the effectiveness and generalization ability of the
proposed method. Our code is available in https://github.com/fanrena/DPT.Comment: 12 pages, 7 figure
SaliencyGAN: Deep Learning Semisupervised Salient Object Detection in the Fog of IoT
In modern Internet of Things (IoT), visual analysis and predictions are often performed by deep learning models. Salient object detection (SOD) is a fundamental preprocessing for these applications. Executing SOD on the fog devices is a challenging task due to the diversity of data and fog devices. To adopt convolutional neural networks (CNN) on fog-cloud infrastructures for SOD-based applications, we introduce a semisupervised adversarial learning method in this article. The proposed model, named as SaliencyGAN, is empowered by a novel concatenated generative adversarial network (GAN) framework with partially shared parameters. The backbone CNN can be chosen flexibly based on the specific devices and applications. In the meanwhile, our method uses both the labeled and unlabeled data from different problem domains for training. Using multiple popular benchmark datasets, we compared state-of-the-art baseline methods to our SaliencyGAN obtained with 10-100% labeled training data. SaliencyGAN gained performance comparable to the supervised baselines when the percentage of labeled data reached 30%, and outperformed the weakly supervised and unsupervised baselines. Furthermore, our ablation study shows that SaliencyGAN were more robust to the common “mode missing” (or “mode collapse”) issue compared to the selected popular GAN models. The visualized ablation results have proved that SaliencyGAN learned a better estimation of data distributions. To the best of our knowledge, this is the first IoT-oriented semisupervised SOD method
Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID
Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to
match pedestrian images of the same identity from different modalities without
annotations. Existing works mainly focus on alleviating the modality gap by
aligning instance-level features of the unlabeled samples. However, the
relationships between cross-modality clusters are not well explored. To this
end, we propose a novel bilateral cluster matching-based learning framework to
reduce the modality gap by matching cross-modality clusters. Specifically, we
design a Many-to-many Bilateral Cross-Modality Cluster Matching (MBCCM)
algorithm through optimizing the maximum matching problem in a bipartite graph.
Then, the matched pairwise clusters utilize shared visible and infrared
pseudo-labels during the model training. Under such a supervisory signal, a
Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework
is proposed to align features jointly at a cluster-level. Meanwhile, the
cross-modality Consistency Constraint (CC) is proposed to explicitly reduce the
large modality discrepancy. Extensive experiments on the public SYSU-MM01 and
RegDB datasets demonstrate the effectiveness of the proposed method, surpassing
state-of-the-art approaches by a large margin of 8.76% mAP on average
Text-based Person Search in Full Images via Semantic-Driven Proposal Generation
Finding target persons in full scene images with a query of text description
has important practical applications in intelligent video surveillance.However,
different from the real-world scenarios where the bounding boxes are not
available, existing text-based person retrieval methods mainly focus on the
cross modal matching between the query text descriptions and the gallery of
cropped pedestrian images. To close the gap, we study the problem of text-based
person search in full images by proposing a new end-to-end learning framework
which jointly optimize the pedestrian detection, identification and
visual-semantic feature embedding tasks. To take full advantage of the query
text, the semantic features are leveraged to instruct the Region Proposal
Network to pay more attention to the text-described proposals. Besides, a
cross-scale visual-semantic embedding mechanism is utilized to improve the
performance. To validate the proposed method, we collect and annotate two
large-scale benchmark datasets based on the widely adopted image-based person
search datasets CUHK-SYSU and PRW. Comprehensive experiments are conducted on
the two datasets and compared with the baseline methods, our method achieves
the state-of-the-art performance
1-Hexadecyl-3-methylimidazolium bromide monohydrate
In the crystal structure of the title compound, C20H39N2
+·Br−·H2O, the 1-hexadecyl-3-methylimidazolium cations are stacked along the b axis, forming channels parallel to [100] which are occupied by the bromide anions and water molecules. The crystal is stabilized by O—H⋯Br, C—H⋯O and C—H⋯Br hydrogen-bonding interactions, generating a two-dimensional network
- …