523 research outputs found
Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation
This paper addresses the simulation-to-real domain gap in 6DoF PE, and
proposes a novel self-supervised keypoint radial voting-based 6DoF PE
framework, effectively narrowing this gap using a learnable kernel in RKHS. We
formulate this domain gap as a distance in high-dimensional feature space,
distinct from previous iterative matching methods. We propose an adapter
network, which evolves the network parameters from the source domain, which has
been massively trained on synthetic data with synthetic poses, to the target
domain, which is trained on real data. Importantly, the real data training only
uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real
groundtruth data annotations. RKHSPose achieves state-of-the-art performance on
three commonly used 6DoF PE datasets including LINEMOD (+4.2%), Occlusion
LINEMOD (+2%), and YCB-Video (+3%). It also compares favorably to fully
supervised methods on all six applicable BOP core datasets, achieving within
-10.8% to -0.3% of the top fully supervised results
Learning Better Keypoints for Multi-Object 6DoF Pose Estimation
We investigate the impact of pre-defined keypoints for pose estimation, and
found that accuracy and efficiency can be improved by training a graph network
to select a set of disperse keypoints with similarly distributed votes. These
votes, learned by a regression network to accumulate evidence for the keypoint
locations, can be regressed more accurately compared to previous heuristic
keypoint algorithms. The proposed KeyGNet, supervised by a combined loss
measuring both Wassserstein distance and dispersion, learns the color and
geometry features of the target objects to estimate optimal keypoint locations.
Experiments demonstrate the keypoints selected by KeyGNet improved the accuracy
for all evaluation metrics of all seven datasets tested, for three keypoint
voting methods. The challenging Occlusion LINEMOD dataset notably improved
ADD(S) by +16.4% on PVN3D, and all core BOP datasets showed an AR improvement
for all objects, of between +1% and +21.5%. There was also a notable increase
in performance when transitioning from single object to multiple object
training using KeyGNet keypoints, essentially eliminating the SISO-MIMO gap for
Occlusion LINEMOD
Context-aware Pedestrian Trajectory Prediction with Multimodal Transformer
We propose a novel solution for predicting future trajectories of
pedestrians. Our method uses a multimodal encoder-decoder transformer
architecture, which takes as input both pedestrian locations and ego-vehicle
speeds. Notably, our decoder predicts the entire future trajectory in a
single-pass and does not perform one-step-ahead prediction, which makes the
method effective for embedded edge deployment. We perform detailed experiments
and evaluate our method on two popular datasets, PIE and JAAD. Quantitative
results demonstrate the superiority of our proposed model over the current
state-of-the-art, which consistently achieves the lowest error for 3 time
horizons of 0.5, 1.0 and 1.5 seconds. Moreover, the proposed method is
significantly faster than the state-of-the-art for the two datasets of PIE and
JAAD. Lastly, ablation experiments demonstrate the impact of the key multimodal
configuration of our method
Can Continual Learning Improve Long-Tailed Recognition? Toward a Unified Framework
The Long-Tailed Recognition (LTR) problem emerges in the context of learning
from highly imbalanced datasets, in which the number of samples among different
classes is heavily skewed. LTR methods aim to accurately learn a dataset
comprising both a larger Head set and a smaller Tail set. We propose a theorem
where under the assumption of strong convexity of the loss function, the
weights of a learner trained on the full dataset are within an upper bound of
the weights of the same learner trained strictly on the Head. Next, we assert
that by treating the learning of the Head and Tail as two separate and
sequential steps, Continual Learning (CL) methods can effectively update the
weights of the learner to learn the Tail without forgetting the Head. First, we
validate our theoretical findings with various experiments on the toy MNIST-LT
dataset. We then evaluate the efficacy of several CL strategies on multiple
imbalanced variations of two standard LTR benchmarks (CIFAR100-LT and
CIFAR10-LT), and show that standard CL methods achieve strong performance gains
in comparison to baselines and approach solutions that have been tailor-made
for LTR. We also assess the applicability of CL techniques on real-world data
by exploring CL on the naturally imbalanced Caltech256 dataset and demonstrate
its superiority over state-of-the-art classifiers. Our work not only unifies
LTR and CL but also paves the way for leveraging advances in CL methods to
tackle the LTR challenge more effectively
ObjectBox: From Centers to Boxes for Anchor-Free Object Detection
We present ObjectBox, a novel single-stage anchor-free and highly
generalizable object detection approach. As opposed to both existing
anchor-based and anchor-free detectors, which are more biased toward specific
object scales in their label assignments, we use only object center locations
as positive samples and treat all objects equally in different feature levels
regardless of the objects' sizes or shapes. Specifically, our label assignment
strategy considers the object center locations as shape- and size-agnostic
anchors in an anchor-free fashion, and allows learning to occur at all scales
for every object. To support this, we define new regression targets as the
distances from two corners of the center cell location to the four sides of the
bounding box. Moreover, to handle scale-variant objects, we propose a tailored
IoU loss to deal with boxes with different sizes. As a result, our proposed
object detector does not need any dataset-dependent hyperparameters to be tuned
across datasets. We evaluate our method on MS-COCO 2017 and PASCAL VOC 2012
datasets, and compare our results to state-of-the-art methods. We observe that
ObjectBox performs favorably in comparison to prior works. Furthermore, we
perform rigorous ablation experiments to evaluate different components of our
method. Our code is available at: https://github.com/MohsenZand/ObjectBox.Comment: ECCV 2022 Ora
Continual Learning for Out-of-Distribution Pedestrian Detection
A continual learning solution is proposed to address the out-of-distribution
generalization problem for pedestrian detection. While recent pedestrian
detection models have achieved impressive performance on various datasets, they
remain sensitive to shifts in the distribution of the inference data. Our
method adopts and modifies Elastic Weight Consolidation to a backbone object
detection network, in order to penalize the changes in the model weights based
on their importance towards the initially learned task. We show that when
trained with one dataset and fine-tuned on another, our solution learns the new
distribution and maintains its performance on the previous one, avoiding
catastrophic forgetting. We use two popular datasets, CrowdHuman and
CityPersons for our cross-dataset experiments, and show considerable
improvements over standard fine-tuning, with a 9% and 18% miss rate percent
reduction improvement in the CrowdHuman and CityPersons datasets, respectively
Flow-based Autoregressive Structured Prediction of Human Motion
A new method is proposed for human motion predition by learning temporal and
spatial dependencies in an end-to-end deep neural network. The joint
connectivity is explicitly modeled using a novel autoregressive structured
prediction representation based on flow-based generative models. We learn a
latent space of complex body poses in consecutive frames which is conditioned
on the high-dimensional structure input sequence. To construct each latent
variable, the general and local smoothness of the joint positions are
considered in a generative process using conditional normalizing flows. As a
result, all frame-level and joint-level continuities in the sequence are
preserved in the model. This enables us to parameterize the inter-frame and
intra-frame relationships and joint connectivity for robust long-term
predictions as well as short-term prediction. Our experiments on two
challenging benchmark datasets of Human3.6M and AMASS demonstrate that our
proposed method is able to effectively model the sequence information for
motion prediction and outperform other techniques in 42 of the 48 total
experiment scenarios to set a new state-of-the-art
- …