We develop a deep architecture to learn to find good correspondences for
wide-baseline stereo. Given a set of putative sparse matches and the camera
intrinsics, we train our network in an end-to-end fashion to label the
correspondences as inliers or outliers, while simultaneously using them to
recover the relative pose, as encoded by the essential matrix. Our architecture
is based on a multi-layer perceptron operating on pixel coordinates rather than
directly on the image, and is thus simple and small. We introduce a novel
normalization technique, called Context Normalization, which allows us to
process each data point separately while imbuing it with global information,
and also makes the network invariant to the order of the correspondences. Our
experiments on multiple challenging datasets demonstrate that our method is
able to drastically improve the state of the art with little training data.Comment: CVPR 2018 (Oral