1,977 research outputs found
D2-Net: A Trainable CNN for Joint Detection and Description of Local Features
In this work we address the problem of finding reliable pixel-level
correspondences under difficult imaging conditions. We propose an approach
where a single convolutional neural network plays a dual role: It is
simultaneously a dense feature descriptor and a feature detector. By postponing
the detection to a later stage, the obtained keypoints are more stable than
their traditional counterparts based on early detection of low-level
structures. We show that this model can be trained using pixel correspondences
extracted from readily available large-scale SfM reconstructions, without any
further annotations. The proposed method obtains state-of-the-art performance
on both the difficult Aachen Day-Night localization dataset and the InLoc
indoor localization benchmark, as well as competitive performance on other
benchmarks for image matching and 3D reconstruction.Comment: Accepted at CVPR 201
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body
keypoints in complex, multi-person video. We propose an extremely lightweight
yet highly effective approach that builds upon the latest advancements in human
detection and video understanding. Our method operates in two-stages: keypoint
estimation in frames or short clips, followed by lightweight tracking to
generate keypoint predictions linked over the entire video. For frame-level
pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D
extension of this model, which leverages temporal information over small clips
to generate more robust frame predictions. We conduct extensive ablative
experiments on the newly released multi-person video pose estimation benchmark,
PoseTrack, to validate various design choices of our model. Our approach
achieves an accuracy of 55.2% on the validation and 51.8% on the test set using
the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art
performance on the ICCV 2017 PoseTrack keypoint tracking challenge.Comment: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint
tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack
and webpage: https://rohitgirdhar.github.io/DetectAndTrack
Video Logo Retrieval based on local Features
Estimation of the frequency and duration of logos in videos is important and
challenging in the advertisement industry as a way of estimating the impact of
ad purchases. Since logos occupy only a small area in the videos, the popular
methods of image retrieval could fail. This paper develops an algorithm called
Video Logo Retrieval (VLR), which is an image-to-video retrieval algorithm
based on the spatial distribution of local image descriptors that measure the
distance between the query image (the logo) and a collection of video images.
VLR uses local features to overcome the weakness of global feature-based models
such as convolutional neural networks (CNN). Meanwhile, VLR is flexible and
does not require training after setting some hyper-parameters. The performance
of VLR is evaluated on two challenging open benchmark tasks (SoccerNet and
Standford I2V), and compared with other state-of-the-art logo retrieval or
detection algorithms. Overall, VLR shows significantly higher accuracy compared
with the existing methods.Comment: Accepted by ICIP 20. Contact author: Bochen Guan ([email protected]
Comparator Networks
The objective of this work is set-based verification, e.g. to decide if two
sets of images of a face are of the same person or not. The traditional
approach to this problem is to learn to generate a feature vector per image,
aggregate them into one vector to represent the set, and then compute the
cosine similarity between sets. Instead, we design a neural network
architecture that can directly learn set-wise verification. Our contributions
are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of
sets (each may contain a variable number of images) as inputs, and compute a
similarity between the pair--this involves attending to multiple discriminative
local regions (landmarks), and comparing local descriptors between pairs of
faces; (ii) To encourage high-quality representations for each set, internal
competition is introduced for recalibration based on the landmark score; (iii)
Inspired by image retrieval, a novel hard sample mining regime is proposed to
control the sampling process, such that the DCN is complementary to the
standard image classification models. Evaluations on the IARPA Janus face
recognition benchmarks show that the comparator networks outperform the
previous state-of-the-art results by a large margin.Comment: To appear in ECCV 201
- …