27 research outputs found
Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-supervised Depth Estimation
Monocular and binocular self-supervised depth estimations are two important
and related tasks in computer vision, which aim to predict scene depths from
single images and stereo image pairs respectively. In literature, the two tasks
are usually tackled separately by two different kinds of models, and binocular
models generally fail to predict depth from single images, while the prediction
accuracy of monocular models is generally inferior to binocular models. In this
paper, we propose a Two-in-One self-supervised depth estimation network, called
TiO-Depth, which could not only compatibly handle the two tasks, but also
improve the prediction accuracy. TiO-Depth employs a Siamese architecture and
each sub-network of it could be used as a monocular depth estimation model. For
binocular depth estimation, a Monocular Feature Matching module is proposed for
incorporating the stereo knowledge between the two images, and the full
TiO-Depth is used to predict depths. We also design a multi-stage
joint-training strategy for improving the performances of TiO-Depth in both two
tasks by combining the relative advantages of them. Experimental results on the
KITTI, Cityscapes, and DDAD datasets demonstrate that TiO-Depth outperforms
both the monocular and binocular state-of-the-art methods in most cases, and
further verify the feasibility of a two-in-one network for monocular and
binocular depth estimation. The code is available at
https://github.com/ZM-Zhou/TiO-Depth_pytorch.Comment: Accepted to ICCV 202
Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition
Open-set image recognition is a challenging topic in computer vision. Most of
the existing works in literature focus on learning more discriminative features
from the input images, however, they are usually insensitive to the high- or
low-frequency components in features, resulting in a decreasing performance on
fine-grained image recognition. To address this problem, we propose a
Complementary Frequency-varying Awareness Network that could better capture
both high-frequency and low-frequency information, called CFAN. The proposed
CFAN consists of three sequential modules: (i) a feature extraction module is
introduced for learning preliminary features from the input images; (ii) a
frequency-varying filtering module is designed to separate out both high- and
low-frequency components from the preliminary features in the frequency domain
via a frequency-adjustable filter; (iii) a complementary temporal aggregation
module is designed for aggregating the high- and low-frequency components via
two Long Short-Term Memory networks into discriminative features. Based on
CFAN, we further propose an open-set fine-grained image recognition method,
called CFAN-OSFGR, which learns image features via CFAN and classifies them via
a linear classifier. Experimental results on 3 fine-grained datasets and 2
coarse-grained datasets demonstrate that CFAN-OSFGR performs significantly
better than 9 state-of-the-art methods in most cases
Recursive Counterfactual Deconfounding for Object Recognition
Image recognition is a classic and common task in the computer vision field,
which has been widely applied in the past decade. Most existing methods in
literature aim to learn discriminative features from labeled images for
classification, however, they generally neglect confounders that infiltrate
into the learned features, resulting in low performances for discriminating
test images. To address this problem, we propose a Recursive Counterfactual
Deconfounding model for object recognition in both closed-set and open-set
scenarios based on counterfactual analysis, called RCD. The proposed model
consists of a factual graph and a counterfactual graph, where the relationships
among image features, model predictions, and confounders are built and updated
recursively for learning more discriminative features. It performs in a
recursive manner so that subtler counterfactual features could be learned and
eliminated progressively, and both the discriminability and generalization of
the proposed model could be improved accordingly. In addition, a negative
correlation constraint is designed for alleviating the negative effects of the
counterfactual features further at the model training stage. Extensive
experimental results on both closed-set recognition task and open-set
recognition task demonstrate that the proposed RCD model performs better than
11 state-of-the-art baselines significantly in most cases
Comparison of IT Neural Response Statistics with Simulations
Lehky et al. (2011) provided a statistical analysis on the responses of the recorded 674 neurons to 806 image stimuli in anterior inferotemporalm (AIT) cortex of two monkeys. In terms of kurtosis and Pareto tail index, they observed that the population sparseness of both unnormalized and normalized responses is always larger than their single-neuron selectivity, hence concluded that the critical features for individual neurons in primate AIT cortex are not very complex, but there is an indefinitely large number of them. In this work, we explore an “inverse problem” by simulation, that is, by simulating each neuron indeed only responds to a very limited number of stimuli among a very large number of neurons and stimuli, to assess whether the population sparseness is always larger than the single-neuron selectivity. Our simulation results show that the population sparseness exceeds the single-neuron selectivity in most cases even if the number of neurons and stimuli are much larger than several hundreds, which confirms the observations in Lehky et al. (2011). In addition, we found that the variances of the computed kurtosis and Pareto tail index are quite large in some cases, which reveals some limitations of these two criteria when used for neuron response evaluation
Zero-Shot Learning from Adversarial Feature Residual to Compact Visual Feature
Recently, many zero-shot learning (ZSL) methods focused on learning
discriminative object features in an embedding feature space, however, the
distributions of the unseen-class features learned by these methods are prone
to be partly overlapped, resulting in inaccurate object recognition. Addressing
this problem, we propose a novel adversarial network to synthesize compact
semantic visual features for ZSL, consisting of a residual generator, a
prototype predictor, and a discriminator. The residual generator is to generate
the visual feature residual, which is integrated with a visual prototype
predicted via the prototype predictor for synthesizing the visual feature. The
discriminator is to distinguish the synthetic visual features from the real
ones extracted from an existing categorization CNN. Since the generated
residuals are generally numerically much smaller than the distances among all
the prototypes, the distributions of the unseen-class features synthesized by
the proposed network are less overlapped. In addition, considering that the
visual features from categorization CNNs are generally inconsistent with their
semantic features, a simple feature selection strategy is introduced for
extracting more compact semantic visual features. Extensive experimental
results on six benchmark datasets demonstrate that our method could achieve a
significantly better performance than existing state-of-the-art methods by
1.2-13.2% in most cases
Spatial-Temporal Attention Network for Open-Set Fine-Grained Image Recognition
Triggered by the success of transformers in various visual tasks, the spatial
self-attention mechanism has recently attracted more and more attention in the
computer vision community. However, we empirically found that a typical vision
transformer with the spatial self-attention mechanism could not learn accurate
attention maps for distinguishing different categories of fine-grained images.
To address this problem, motivated by the temporal attention mechanism in
brains, we propose a spatial-temporal attention network for learning
fine-grained feature representations, called STAN, where the features learnt by
implementing a sequence of spatial self-attention operations corresponding to
multiple moments are aggregated progressively. The proposed STAN consists of
four modules: a self-attention backbone module for learning a sequence of
features with self-attention operations, a spatial feature self-organizing
module for facilitating the model training, a spatial-temporal feature learning
module for aggregating the re-organized features via a Long Short-Term Memory
network, and a context-aware module that is implemented as the forget block of
the spatial-temporal feature learning module for preserving/forgetting the
long-term memory by utilizing contextual information. Then, we propose a
STAN-based method for open-set fine-grained recognition by integrating the
proposed STAN network with a linear classifier, called STAN-OSFGR. Extensive
experimental results on 3 fine-grained datasets and 2 coarse-grained datasets
demonstrate that the proposed STAN-OSFGR outperforms 9 state-of-the-art
open-set recognition methods significantly in most cases