175 research outputs found
Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models
Diffusion-based Generative Models (DGMs) have achieved unparalleled
performance in synthesizing high-quality visual content, opening up the
opportunity to improve image super-resolution (SR) tasks. Recent solutions for
these tasks often train architecture-specific DGMs from scratch, or require
iterative fine-tuning and distillation on pre-trained DGMs, both of which take
considerable time and hardware investments. More seriously, since the DGMs are
established with a discrete pre-defined upsampling scale, they cannot well
match the emerging requirements of arbitrary-scale super-resolution (ASSR),
where a unified model adapts to arbitrary upsampling scales, instead of
preparing a series of distinct models for each case. These limitations beg an
intriguing question: can we identify the ASSR capability of existing
pre-trained DGMs without the need for distillation or fine-tuning? In this
paper, we take a step towards resolving this matter by proposing Diff-SR, a
first ASSR attempt based solely on pre-trained DGMs, without additional
training efforts. It is motivated by an exciting finding that a simple
methodology, which first injects a specific amount of noise into the
low-resolution images before invoking a DGM's backward diffusion process,
outperforms current leading solutions. The key insight is determining a
suitable amount of noise to inject, i.e., small amounts lead to poor low-level
fidelity, while over-large amounts degrade the high-level signature. Through a
finely-grained theoretical analysis, we propose the Perceptual Recoverable
Field (PRF), a metric that achieves the optimal trade-off between these two
factors. Extensive experiments verify the effectiveness, flexibility, and
adaptability of Diff-SR, demonstrating superior performance to state-of-the-art
solutions under diverse ASSR environments
Viewpoint-Aware Loss with Angular Regularization for Person Re-Identification
Although great progress in supervised person re-identification (Re-ID) has
been made recently, due to the viewpoint variation of a person, Re-ID remains a
massive visual challenge. Most existing viewpoint-based person Re-ID methods
project images from each viewpoint into separated and unrelated sub-feature
spaces. They only model the identity-level distribution inside an individual
viewpoint but ignore the underlying relationship between different viewpoints.
To address this problem, we propose a novel approach, called
\textit{Viewpoint-Aware Loss with Angular Regularization }(\textbf{VA-reID}).
Instead of one subspace for each viewpoint, our method projects the feature
from different viewpoints into a unified hypersphere and effectively models the
feature distribution on both the identity-level and the viewpoint-level. In
addition, rather than modeling different viewpoints as hard labels used for
conventional viewpoint classification, we introduce viewpoint-aware adaptive
label smoothing regularization (VALSR) that assigns the adaptive soft label to
feature representation. VALSR can effectively solve the ambiguity of the
viewpoint cluster label assignment. Extensive experiments on the Market1501 and
DukeMTMC-reID datasets demonstrated that our method outperforms the
state-of-the-art supervised Re-ID methods
Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect
Recently, the research interest of person re-identification (ReID) has
gradually turned to video-based methods, which acquire a person representation
by aggregating frame features of an entire video. However, existing video-based
ReID methods do not consider the semantic difference brought by the outputs of
different network stages, which potentially compromises the information
richness of the person features. Furthermore, traditional methods ignore
important relationship among frames, which causes information redundancy in
fusion along the time axis. To address these issues, we propose a novel general
temporal fusion framework to aggregate frame features on both semantic aspect
and time aspect. As for the semantic aspect, a multi-stage fusion network is
explored to fuse richer frame features at multiple semantic levels, which can
effectively reduce the information loss caused by the traditional single-stage
fusion. While, for the time axis, the existing intra-frame attention method is
improved by adding a novel inter-frame attention module, which effectively
reduces the information redundancy in temporal fusion by taking the
relationship among frames into consideration. The experimental results show
that our approach can effectively improve the video-based re-identification
accuracy, achieving the state-of-the-art performance
- …