153 research outputs found
Generative Action Description Prompts for Skeleton-based Action Recognition
Skeleton-based action recognition has recently received considerable
attention. Current approaches to skeleton-based action recognition are
typically formulated as one-hot classification tasks and do not fully exploit
the semantic relations between actions. For example, "make victory sign" and
"thumb up" are two actions of hand gestures, whose major difference lies in the
movement of hands. This information is agnostic from the categorical one-hot
encoding of action classes but could be unveiled from the action description.
Therefore, utilizing action description in training could potentially benefit
representation learning. In this work, we propose a Generative
Action-description Prompts (GAP) approach for skeleton-based action
recognition. More specifically, we employ a pre-trained large-scale language
model as the knowledge engine to automatically generate text descriptions for
body parts movements of actions, and propose a multi-modal training scheme by
utilizing the text encoder to generate feature vectors for different body parts
and supervise the skeleton encoder for action representation learning.
Experiments show that our proposed GAP method achieves noticeable improvements
over various baseline models without extra computation cost at inference. GAP
achieves new state-of-the-arts on popular skeleton-based action recognition
benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA. The source code is
available at https://github.com/MartinXM/GAP.Comment: Accepted by ICCV2
Anomaly Detection by Adapting a pre-trained Vision Language Model
Recently, large vision and language models have shown their success when
adapting them to many downstream tasks. In this paper, we present a unified
framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP
model. To this end, we make two important improvements: 1) To acquire unified
anomaly detection across industrial images of multiple categories, we introduce
the learnable prompt and propose to associate it with abnormal patterns through
self-supervised learning. 2) To fully exploit the representation power of CLIP,
we introduce an anomaly region refinement strategy to refine the localization
quality. During testing, the anomalies are localized by directly calculating
the similarity between the representation of the learnable prompt and the
image. Comprehensive experiments demonstrate the superiority of our framework,
e.g., we achieve the state-of-the-art 97.5/55.6 and 89.3/33.1 on MVTec-AD and
VisA for anomaly detection and localization. In addition, the proposed method
also achieves encouraging performance with marginal training data, which is
more challenging
Highly reversible transition metal migration in superstructure-free Li-rich oxide boosting voltage stability and redox symmetry
The further practical applications of Li-rich layered oxides are impeded by voltage decay and redox asymmetry, which are closely related to the structural degradation involving irreversible transition metal migration. It has been demonstrated that the superstructure ordering in O2-type materials can effectively suppress voltage decay and redox asymmetry. Herein, we elucidate that the absence of this superstructure ordering arrangement in a Ru-based O2-type oxide can still facilitate the highly reversible transition metal migration. We certify that Ru in superstructure-free O2-type structure can unlock a quite different migration path from Mn in mostly studied cases. The highly reversible migration of Ru helps the cathode maintain the structural robustness, thus realizing terrific capacity retention with neglectable voltage decay and inhibited oxygen redox asymmetry. We untie the knot that the absence of superstructure ordering fails to enable a high-performance Li-rich layered oxide cathode material with suppressed voltage decay and redox asymmetry
SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection
Synthetic Aperture Radar (SAR) object detection has gained significant
attention recently due to its irreplaceable all-weather imaging capabilities.
However, this research field suffers from both limited public datasets (mostly
comprising <2K images with only mono-category objects) and inaccessible source
code. To tackle these challenges, we establish a new benchmark dataset and an
open-source method for large-scale SAR object detection. Our dataset,
SARDet-100K, is a result of intense surveying, collecting, and standardizing 10
existing SAR detection datasets, providing a large-scale and diverse dataset
for research purposes. To the best of our knowledge, SARDet-100K is the first
COCO-level large-scale multi-class SAR object detection dataset ever created.
With this high-quality dataset, we conducted comprehensive experiments and
uncovered a crucial challenge in SAR object detection: the substantial
disparities between the pretraining on RGB datasets and finetuning on SAR
datasets in terms of both data domain and model structure. To bridge these
gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA)
pretraining framework that tackles the problems from the perspective of data
input, domain transition, and model migration. The proposed MSFA method
significantly enhances the performance of SAR object detection models while
demonstrating exceptional generalizability and flexibility across diverse
models. This work aims to pave the way for further advancements in SAR object
detection. The dataset and code is available at
https://github.com/zcablii/SARDet_100K.Comment: 22 Pages, 10 Figures, 9 Table
Diff-ID: An Explainable Identity Difference Quantification Framework for DeepFake Detection
Despite the fact that DeepFake forgery detection algorithms have achieved
impressive performance on known manipulations, they often face disastrous
performance degradation when generalized to an unseen manipulation. Some recent
works show improvement in generalization but rely on features fragile to image
distortions such as compression. To this end, we propose Diff-ID, a concise and
effective approach that explains and measures the identity loss induced by
facial manipulations. When testing on an image of a specific person, Diff-ID
utilizes an authentic image of that person as a reference and aligns them to
the same identity-insensitive attribute feature space by applying a
face-swapping generator. We then visualize the identity loss between the test
and the reference image from the image differences of the aligned pairs, and
design a custom metric to quantify the identity loss. The metric is then proved
to be effective in distinguishing the forgery images from the real ones.
Extensive experiments show that our approach achieves high detection
performance on DeepFake images and state-of-the-art generalization ability to
unknown forgery methods, while also being robust to image distortions
- …