18 research outputs found
Semi-supervised Deep Multi-view Stereo
Significant progress has been witnessed in learning-based Multi-view Stereo
(MVS) under supervised and unsupervised settings. To combine their respective
merits in accuracy and completeness, meantime reducing the demand for expensive
labeled data, this paper explores the problem of learning-based MVS in a
semi-supervised setting that only a tiny part of the MVS data is attached with
dense depth ground truth. However, due to huge variation of scenarios and
flexible settings in views, it may break the basic assumption in classic
semi-supervised learning, that unlabeled data and labeled data share the same
label space and data distribution, named as semi-supervised distribution-gap
ambiguity in the MVS problem. To handle these issues, we propose a novel
semi-supervised distribution-augmented MVS framework, namely SDA-MVS. For the
simple case that the basic assumption works in MVS data, consistency
regularization encourages the model predictions to be consistent between
original sample and randomly augmented sample. For further troublesome case
that the basic assumption is conflicted in MVS data, we propose a novel style
consistency loss to alleviate the negative effect caused by the distribution
gap. The visual style of unlabeled sample is transferred to labeled sample to
shrink the gap, and the model prediction of generated sample is further
supervised with the label in original labeled sample. The experimental results
in semi-supervised settings of multiple MVS datasets show the superior
performance of the proposed method. With the same settings in backbone network,
our proposed SDA-MVS outperforms its fully-supervised and unsupervised
baselines.Comment: This paper is accepted in ACMMM-2023. The code is released at:
https://github.com/ToughStoneX/Semi-MV
TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective
Vision Transformers (ViTs) have demonstrated powerful representation ability
in various visual tasks thanks to their intrinsic data-hungry nature. However,
we unexpectedly find that ViTs perform vulnerably when applied to face
recognition (FR) scenarios with extremely large datasets. We investigate the
reasons for this phenomenon and discover that the existing data augmentation
approach and hard sample mining strategy are incompatible with ViTs-based FR
backbone due to the lack of tailored consideration on preserving face
structural information and leveraging each local token information. To remedy
these problems, this paper proposes a superior FR model called TransFace, which
employs a patch-level data augmentation strategy named DPAP and a hard sample
mining strategy named EHSM. Specially, DPAP randomly perturbs the amplitude
information of dominant patches to expand sample diversity, which effectively
alleviates the overfitting problem in ViTs. EHSM utilizes the information
entropy in the local tokens to dynamically adjust the importance weight of easy
and hard samples during training, leading to a more stable prediction.
Experiments on several benchmarks demonstrate the superiority of our TransFace.
Code and models are available at https://github.com/DanJun6737/TransFace.Comment: Accepted by ICCV 202
Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment
Self-supervised contrastive learning has demonstrated great potential in
learning visual representations. Despite their success on various downstream
tasks such as image classification and object detection, self-supervised
pre-training for fine-grained scenarios is not fully explored. In this paper,
we first point out that current contrastive methods are prone to memorizing
background/foreground texture and therefore have a limitation in localizing the
foreground object. Analysis suggests that learning to extract discriminative
texture information and localization are equally crucial for self-supervised
pre-training in fine-grained scenarios. Based on our findings, we introduce
cross-view saliency alignment (CVSA), a contrastive learning framework that
first crops and swaps saliency regions of images as a novel view generation and
then guides the model to localize on the foreground object via a cross-view
alignment loss. Extensive experiments on four popular fine-grained
classification benchmarks show that CVSA significantly improves the learned
representation.Comment: The second version of CVSA. 10 pages, 4 figure
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN
Masked image modeling (MIM), an emerging self-supervised pre-training method,
has shown impressive success across numerous downstream vision tasks with
Vision transformers (ViTs). Its underlying idea is simple: a portion of the
input image is randomly masked out and then reconstructed via the pre-text
task. However, the working principle behind MIM is not well explained, and
previous studies insist that MIM primarily works for the Transformer family but
is incompatible with CNNs. In this paper, we first study interactions among
patches to understand what knowledge is learned and how it is acquired via the
MIM task. We observe that MIM essentially teaches the model to learn better
middle-order interactions among patches and extract more generalized features.
Based on this fact, we propose an Architecture-Agnostic Masked Image Modeling
framework (AMIM), which is compatible with both Transformers and CNNs in a
unified way. Extensive experiments on popular benchmarks show that our AMIM
learns better representations without explicit design and endows the backbone
model with the stronger capability to transfer to various downstream tasks for
both Transformers and CNNs.Comment: Preprint under review (update reversion). The source code will be
released in https://github.com/Westlake-AI/openmixu
Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning
While bisimulation-based approaches hold promise for learning robust state
representations for Reinforcement Learning (RL) tasks, their efficacy in
offline RL tasks has not been up to par. In some instances, their performance
has even significantly underperformed alternative methods. We aim to understand
why bisimulation methods succeed in online settings, but falter in offline
tasks. Our analysis reveals that missing transitions in the dataset are
particularly harmful to the bisimulation principle, leading to ineffective
estimation. We also shed light on the critical role of reward scaling in
bounding the scale of bisimulation measurements and of the value error they
induce. Based on these findings, we propose to apply the expectile operator for
representation learning to our offline RL setting, which helps to prevent
overfitting to incomplete data. Meanwhile, by introducing an appropriate reward
scaling strategy, we avoid the risk of feature collapse in representation
space. We implement these recommendations on two state-of-the-art
bisimulation-based algorithms, MICo and SimSR, and demonstrate performance
gains on two benchmark suites: D4RL and Visual D4RL. Codes are provided at
\url{https://github.com/zanghyu/Offline_Bisimulation}.Comment: NeurIPS 202
EVNet: An Explainable Deep Network for Dimension Reduction
Dimension reduction (DR) is commonly utilized to capture the intrinsic
structure and transform high-dimensional data into low-dimensional space while
retaining meaningful properties of the original data. It is used in various
applications, such as image recognition, single-cell sequencing analysis, and
biomarker discovery. However, contemporary parametric-free and parametric DR
techniques suffer from several significant shortcomings, such as the inability
to preserve global and local features and the pool generalization performance.
On the other hand, regarding explainability, it is crucial to comprehend the
embedding process, especially the contribution of each part to the embedding
process, while understanding how each feature affects the embedding results
that identify critical components and help diagnose the embedding process. To
address these problems, we have developed a deep neural network method called
EVNet, which provides not only excellent performance in structural
maintainability but also explainability to the DR therein. EVNet starts with
data augmentation and a manifold-based loss function to improve embedding
performance. The explanation is based on saliency maps and aims to examine the
trained EVNet parameters and contributions of components during the embedding
process. The proposed techniques are integrated with a visual interface to help
the user to adjust EVNet to achieve better DR performance and explainability.
The interactive visual interface makes it easier to illustrate the data
features, compare different DR techniques, and investigate DR. An in-depth
experimental comparison shows that EVNet consistently outperforms the
state-of-the-art methods in both performance measures and explainability.Comment: 18 pages, 15 figures, accepted by TVC
The 3rd Anti-UAV Workshop & Challenge: Methods and Results
The 3rd Anti-UAV Workshop & Challenge aims to encourage research in
developing novel and accurate methods for multi-scale object tracking. The
Anti-UAV dataset used for the Anti-UAV Challenge has been publicly released.
There are two main differences between this year's competition and the previous
two. First, we have expanded the existing dataset, and for the first time,
released a training set so that participants can focus on improving their
models. Second, we set up two tracks for the first time, i.e., Anti-UAV
Tracking and Anti-UAV Detection & Tracking. Around 76 participating teams from
the globe competed in the 3rd Anti-UAV Challenge. In this paper, we provide a
brief summary of the 3rd Anti-UAV Workshop & Challenge including brief
introductions to the top three methods in each track. The submission
leaderboard will be reopened for researchers that are interested in the
Anti-UAV challenge. The benchmark dataset and other information can be found
at: https://anti-uav.github.io/.Comment: Technical report for 3rd Anti-UAV Workshop and Challenge. arXiv admin
note: text overlap with arXiv:2108.0990