555 research outputs found
Permutation-invariant Feature Restructuring for Correlation-aware Image Set-based Recognition
We consider the problem of comparing the similarity of image sets with
variable-quantity, quality and un-ordered heterogeneous images. We use feature
restructuring to exploit the correlations of both innerinter-set images.
Specifically, the residual self-attention can effectively restructure the
features using the other features within a set to emphasize the discriminative
images and eliminate the redundancy. Then, a sparse/collaborative
learning-based dependency-guided representation scheme reconstructs the probe
features conditional to the gallery features in order to adaptively align the
two sets. This enables our framework to be compatible with both verification
and open-set identification. We show that the parametric self-attention network
and non-parametric dictionary learning can be trained end-to-end by a unified
alternative optimization scheme, and that the full framework is
permutation-invariant. In the numerical experiments we conducted, our method
achieves top performance on competitive image set/video-based face recognition
and person re-identification benchmarks.Comment: Accepted to ICCV 201
AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation
This paper targets on learning-based novel view synthesis from a single or
limited 2D images without the pose supervision. In the viewer-centered
coordinates, we construct an end-to-end trainable conditional variational
framework to disentangle the unsupervisely learned relative-pose/rotation and
implicit global 3D representation (shape, texture and the origin of
viewer-centered coordinates, etc.). The global appearance of the 3D object is
given by several appearance-describing images taken from any number of
viewpoints. Our spatial correlation module extracts a global 3D representation
from the appearance-describing images in a permutation invariant manner. Our
system can achieve implicitly 3D understanding without explicitly 3D
reconstruction. With an unsupervisely learned viewer-centered
relative-pose/rotation code, the decoder can hallucinate the novel view
continuously by sampling the relative-pose in a prior distribution. In various
applications, we demonstrate that our model can achieve comparable or even
better results than pose/3D model-supervised learning-based novel view
synthesis (NVS) methods with any number of input views.Comment: ECCV 202
Tree Structure-Aware Few-Shot Image Classification via Hierarchical Aggregation
In this paper, we mainly focus on the problem of how to learn additional
feature representations for few-shot image classification through pretext tasks
(e.g., rotation or color permutation and so on). This additional knowledge
generated by pretext tasks can further improve the performance of few-shot
learning (FSL) as it differs from human-annotated supervision (i.e., class
labels of FSL tasks). To solve this problem, we present a plug-in Hierarchical
Tree Structure-aware (HTS) method, which not only learns the relationship of
FSL and pretext tasks, but more importantly, can adaptively select and
aggregate feature representations generated by pretext tasks to maximize the
performance of FSL tasks. A hierarchical tree constructing component and a
gated selection aggregating component is introduced to construct the tree
structure and find richer transferable knowledge that can rapidly adapt to
novel classes with a few labeled images. Extensive experiments show that our
HTS can significantly enhance multiple few-shot methods to achieve new
state-of-the-art performance on four benchmark datasets. The code is available
at: https://github.com/remiMZ/HTS-ECCV22.Comment: 22 pages, 9 figures and 4 tables Accepted by ECCV 202
Cluster and Aggregate: Face Recognition with Large Probe Set
Feature fusion plays a crucial role in unconstrained face recognition where
inputs (probes) comprise of a set of low quality images whose individual
qualities vary. Advances in attention and recurrent modules have led to feature
fusion that can model the relationship among the images in the input set.
However, attention mechanisms cannot scale to large due to their quadratic
complexity and recurrent modules suffer from input order sensitivity. We
propose a two-stage feature fusion paradigm, Cluster and Aggregate, that can
both scale to large and maintain the ability to perform sequential
inference with order invariance. Specifically, Cluster stage is a linear
assignment of inputs to global cluster centers, and Aggregation stage
is a fusion over clustered features. The clustered features play an
integral role when the inputs are sequential as they can serve as a
summarization of past features. By leveraging the order-invariance of
incremental averaging operation, we design an update rule that achieves
batch-order invariance, which guarantees that the contributions of early image
in the sequence do not diminish as time steps increase. Experiments on IJB-B
and IJB-S benchmark datasets show the superiority of the proposed two-stage
paradigm in unconstrained face recognition. Code and pretrained models are
available in https://github.com/mk-minchul/cafaceComment: To appear in NeurIPS 202
CoNAN: Conditional Neural Aggregation Network For Unconstrained Face Feature Fusion
Face recognition from image sets acquired under unregulated and uncontrolled
settings, such as at large distances, low resolutions, varying viewpoints,
illumination, pose, and atmospheric conditions, is challenging. Face feature
aggregation, which involves aggregating a set of N feature representations
present in a template into a single global representation, plays a pivotal role
in such recognition systems. Existing works in traditional face feature
aggregation either utilize metadata or high-dimensional intermediate feature
representations to estimate feature quality for aggregation. However,
generating high-quality metadata or style information is not feasible for
extremely low-resolution faces captured in long-range and high altitude
settings. To overcome these limitations, we propose a feature distribution
conditioning approach called CoNAN for template aggregation. Specifically, our
method aims to learn a context vector conditioned over the distribution
information of the incoming feature set, which is utilized to weigh the
features based on their estimated informativeness. The proposed method produces
state-of-the-art results on long-range unconstrained face recognition datasets
such as BTS, and DroneSURF, validating the advantages of such an aggregation
strategy.Comment: Paper accepted at IJCB 202
- …