151 research outputs found
A Novel BiLevel Paradigm for Image-to-Image Translation
Image-to-image (I2I) translation is a pixel-level mapping that requires a
large number of paired training data and often suffers from the problems of
high diversity and strong category bias in image scenes. In order to tackle
these problems, we propose a novel BiLevel (BiL) learning paradigm that
alternates the learning of two models, respectively at an instance-specific
(IS) and a general-purpose (GP) level. In each scene, the IS model learns to
maintain the specific scene attributes. It is initialized by the GP model that
learns from all the scenes to obtain the generalizable translation knowledge.
This GP initialization gives the IS model an efficient starting point, thus
enabling its fast adaptation to the new scene with scarce training data. We
conduct extensive I2I translation experiments on human face and street view
datasets. Quantitative results validate that our approach can significantly
boost the performance of classical I2I translation models, such as PG2 and
Pix2Pix. Our visualization results show both higher image quality and more
appropriate instance-specific details, e.g., the translated image of a person
looks more like that person in terms of identity
Video Anomaly Detection and Explanation via Large Language Models
Video Anomaly Detection (VAD) aims to localize abnormal events on the
timeline of long-range surveillance videos. Anomaly-scoring-based methods have
been prevailing for years but suffer from the high complexity of thresholding
and low explanability of detection results. In this paper, we conduct pioneer
research on equipping video-based large language models (VLLMs) in the
framework of VAD, making the VAD model free from thresholds and able to explain
the reasons for the detected anomalies. We introduce a novel network module
Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range
context modeling. We design a three-phase training method to improve the
efficiency of fine-tuning VLLMs by substantially minimizing the requirements
for VAD data and lowering the costs of annotating instruction-tuning data. Our
trained model achieves the top performance on the anomaly videos of the
UCF-Crime and TAD benchmarks, with the AUC improvements of +3.86\% and +4.96\%,
respectively. More impressively, our approach can provide textual explanations
for detected anomalies.Comment: 9 pages, 6 figure
Non-Visible Light Data Synthesis and Application: A Case Study for Synthetic Aperture Radar Imagery
We explore the "hidden" ability of large-scale pre-trained image generation
models, such as Stable Diffusion and Imagen, in non-visible light domains,
taking Synthetic Aperture Radar (SAR) data for a case study. Due to the
inherent challenges in capturing satellite data, acquiring ample SAR training
samples is infeasible. For instance, for a particular category of ship in the
open sea, we can collect only few-shot SAR images which are too limited to
derive effective ship recognition models. If large-scale models pre-trained
with regular images can be adapted to generating novel SAR images, the problem
is solved. In preliminary study, we found that fine-tuning these models with
few-shot SAR images is not working, as the models can not capture the two
primary differences between SAR and regular images: structure and modality. To
address this, we propose a 2-stage low-rank adaptation method, and we call it
2LoRA. In the first stage, the model is adapted using aerial-view regular image
data (whose structure matches SAR), followed by the second stage where the base
model from the first stage is further adapted using SAR modality data.
Particularly in the second stage, we introduce a novel prototype LoRA (pLoRA),
as an improved version of 2LoRA, to resolve the class imbalance problem in SAR
datasets. For evaluation, we employ the resulting generation model to
synthesize additional SAR data. This augmentation, when integrated into the
training process of SAR classification as well as segmentation models, yields
notably improved performance for minor classe
Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation
Domain Adaptation (DA) is always challenged by the spurious correlation
between domain-invariant features (e.g., class identity) and domain-specific
features (e.g., environment) that does not generalize to the target domain.
Unfortunately, even enriched with additional unsupervised target domains,
existing Unsupervised DA (UDA) methods still suffer from it. This is because
the source domain supervision only considers the target domain samples as
auxiliary data (e.g., by pseudo-labeling), yet the inherent distribution in the
target domain -- where the valuable de-correlation clues hide -- is
disregarded. We propose to make the U in UDA matter by giving equal status to
the two domains. Specifically, we learn an invariant classifier whose
prediction is simultaneously consistent with the labels in the source domain
and clusters in the target domain, hence the spurious correlation inconsistent
in the target domain is removed. We dub our approach "Invariant CONsistency
learning" (ICON). Extensive experiments show that ICON achieves the
state-of-the-art performance on the classic UDA benchmarks: Office-Home and
VisDA-2017, and outperforms all the conventional methods on the challenging
WILDS 2.0 benchmark. Codes are in https://github.com/yue-zhongqi/ICON.Comment: Accepted by NeurIPS 202
Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation
We propose a novel approach to jointly perform 3D shape retrieval and pose
estimation from monocular images.In order to make the method robust to
real-world image variations, e.g. complex textures and backgrounds, we learn an
embedding space from 3D data that only includes the relevant information,
namely the shape and pose. Our approach explicitly disentangles a shape vector
and a pose vector, which alleviates both pose bias for 3D shape retrieval and
categorical bias for pose estimation. We then train a CNN to map the images to
this embedding space, and then retrieve the closest 3D shape from the database
and estimate the 6D pose of the object. Our method achieves 10.3 median error
for pose estimation and 0.592 top-1-accuracy for category agnostic 3D object
retrieval on the Pascal3D+ dataset, outperforming the previous state-of-the-art
methods on both tasks
Class-Incremental Exemplar Compression for Class-Incremental Learning
Exemplar-based class-incremental learning (CIL) finetunes the model with all
samples of new classes but few-shot exemplars of old classes in each
incremental phase, where the "few-shot" abides by the limited memory budget. In
this paper, we break this "few-shot" limit based on a simple yet surprisingly
effective idea: compressing exemplars by downsampling non-discriminative pixels
and saving "many-shot" compressed exemplars in the memory. Without needing any
manual annotation, we achieve this compression by generating 0-1 masks on
discriminative pixels from class activation maps (CAM). We propose an adaptive
mask generation model called class-incremental masking (CIM) to explicitly
resolve two difficulties of using CAM: 1) transforming the heatmaps of CAM to
0-1 masks with an arbitrary threshold leads to a trade-off between the coverage
on discriminative pixels and the quantity of exemplars, as the total memory is
fixed; and 2) optimal thresholds vary for different object classes, which is
particularly obvious in the dynamic environment of CIL. We optimize the CIM
model alternatively with the conventional CIL model through a bilevel
optimization problem. We conduct extensive experiments on high-resolution CIL
benchmarks including Food-101, ImageNet-100, and ImageNet-1000, and show that
using the compressed exemplars by CIM can achieve a new state-of-the-art CIL
accuracy, e.g., 4.8 percentage points higher than FOSTER on 10-Phase
ImageNet-1000. Our code is available at https://github.com/xfflzl/CIM-CIL.Comment: Accepted to CVPR 202
- …