494 research outputs found
SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and Multi-View for 3D Object Retrieval
To address 3D object retrieval, substantial efforts have been made to
generate highly discriminative descriptors of 3D objects represented by a
single modality, e.g., voxels, point clouds or multi-view images. It is
promising to leverage the complementary information from multi-modality
representations of 3D objects to further improve retrieval performance.
However, multi-modality 3D object retrieval is rarely developed and analyzed on
large-scale datasets. In this paper, we propose self-and-cross attention based
aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object
retrieval. With deep features extracted from point clouds and multi-view
images, we design two types of feature aggregation modules, namely the
In-Modality Aggregation Module (IMAM) and the Cross-Modality Aggregation Module
(CMAM), for effective feature fusion. IMAM leverages a self-attention mechanism
to aggregate multi-view features while CMAM exploits a cross-attention
mechanism to interact point cloud features with multi-view features. The final
descriptor of a 3D object for object retrieval can be obtained via
concatenating the aggregated features from both modules. Extensive experiments
and analysis are conducted on three datasets, ranging from small to large
scale, to show the superiority of the proposed SCA-PVNet over the
state-of-the-art methods
LATFormer: Locality-Aware Point-View Fusion Transformer for 3D Shape Recognition
Recently, 3D shape understanding has achieved significant progress due to the
advances of deep learning models on various data formats like images, voxels,
and point clouds. Among them, point clouds and multi-view images are two
complementary modalities of 3D objects and learning representations by fusing
both of them has been proven to be fairly effective. While prior works
typically focus on exploiting global features of the two modalities, herein we
argue that more discriminative features can be derived by modeling ``where to
fuse''. To investigate this, we propose a novel Locality-Aware Point-View
Fusion Transformer (LATFormer) for 3D shape retrieval and classification. The
core component of LATFormer is a module named Locality-Aware Fusion (LAF) which
integrates the local features of correlated regions across the two modalities
based on the co-occurrence scores. We further propose to filter out scores with
low values to obtain salient local co-occurring regions, which reduces
redundancy for the fusion process. In our LATFormer, we utilize the LAF module
to fuse the multi-scale features of the two modalities both bidirectionally and
hierarchically to obtain more informative features. Comprehensive experiments
on four popular 3D shape benchmarks covering 3D object retrieval and
classification validate its effectiveness
Reducing Training Demands for 3D Gait Recognition with Deep Koopman Operator Constraints
Deep learning research has made many biometric recognition solution viable,
but it requires vast training data to achieve real-world generalization. Unlike
other biometric traits, such as face and ear, gait samples cannot be easily
crawled from the web to form massive unconstrained datasets. As the human body
has been extensively studied for different digital applications, one can rely
on prior shape knowledge to overcome data scarcity. This work follows the
recent trend of fitting a 3D deformable body model into gait videos using deep
neural networks to obtain disentangled shape and pose representations for each
frame. To enforce temporal consistency in the network, we introduce a new
Linear Dynamical Systems (LDS) module and loss based on Koopman operator
theory, which provides an unsupervised motion regularization for the periodic
nature of gait, as well as a predictive capacity for extending gait sequences.
We compare LDS to the traditional adversarial training approach and use the USF
HumanID and CASIA-B datasets to show that LDS can obtain better accuracy with
less training data. Finally, we also show that our 3D modeling approach is much
better than other 3D gait approaches in overcoming viewpoint variation under
normal, bag-carrying and clothing change conditions
Looking Beyond Appearances: Synthetic Training Data for Deep CNNs in Re-identification
Re-identification is generally carried out by encoding the appearance of a
subject in terms of outfit, suggesting scenarios where people do not change
their attire. In this paper we overcome this restriction, by proposing a
framework based on a deep convolutional neural network, SOMAnet, that
additionally models other discriminative aspects, namely, structural attributes
of the human figure (e.g. height, obesity, gender). Our method is unique in
many respects. First, SOMAnet is based on the Inception architecture, departing
from the usual siamese framework. This spares expensive data preparation
(pairing images across cameras) and allows the understanding of what the
network learned. Second, and most notably, the training data consists of a
synthetic 100K instance dataset, SOMAset, created by photorealistic human body
generation software. Synthetic data represents a good compromise between
realistic imagery, usually not required in re-identification since surveillance
cameras capture low-resolution silhouettes, and complete control of the
samples, which is useful in order to customize the data w.r.t. the surveillance
scenario at-hand, e.g. ethnicity. SOMAnet, trained on SOMAset and fine-tuned on
recent re-identification benchmarks, outperforms all competitors, matching
subjects even with different apparel. The combination of synthetic data with
Inception architectures opens up new research avenues in re-identification.Comment: 14 page
DiffVein: A Unified Diffusion Network for Finger Vein Segmentation and Authentication
Finger vein authentication, recognized for its high security and specificity,
has become a focal point in biometric research. Traditional methods
predominantly concentrate on vein feature extraction for discriminative
modeling, with a limited exploration of generative approaches. Suffering from
verification failure, existing methods often fail to obtain authentic vein
patterns by segmentation. To fill this gap, we introduce DiffVein, a unified
diffusion model-based framework which simultaneously addresses vein
segmentation and authentication tasks. DiffVein is composed of two dedicated
branches: one for segmentation and the other for denoising. For better feature
interaction between these two branches, we introduce two specialized modules to
improve their collective performance. The first, a mask condition module,
incorporates the semantic information of vein patterns from the segmentation
branch into the denoising process. Additionally, we also propose a Semantic
Difference Transformer (SD-Former), which employs Fourier-space self-attention
and cross-attention modules to extract category embedding before feeding it to
the segmentation task. In this way, our framework allows for a dynamic
interplay between diffusion and segmentation embeddings, thus vein segmentation
and authentication tasks can inform and enhance each other in the joint
training. To further optimize our model, we introduce a Fourier-space
Structural Similarity (FourierSIM) loss function, which is tailored to improve
the denoising network's learning efficacy. Extensive experiments on the USM and
THU-MVFV3V datasets substantiates DiffVein's superior performance, setting new
benchmarks in both vein segmentation and authentication tasks
- …