On Modeling Long-Range Dependencies for Visual Perception

Abstract

One of the ultimate goals of computer vision is to extract useful information from visual inputs. An example is to recognize and segment objects from natural images. Recently, deep networks enable us to do a wide range of these tasks better than ever. These are mostly achieved with convolutional neural networks that model pixel relations within a small convolution kernel. Despite such success of convolution, the local window approximation makes it challenging to capture long-range relations. This limitation results in problems, such as unsatisfactory generalization and robustness to out-of-distribution examples. In this dissertation, I aim to model long-range dependencies in the context of natural image perception. The first part of the dissertation is focused on designing neural architectures that are flexible enough to capture long-range relations. We start by improving convolutional networks with dynamic scaling policies. Then, we explore an alternative solution that completely replaces convolution with global self-attention to capture more context. The attention mechanism is further extended to modeling relations between the pixels and the objects with a transformer, enabling panoptic segmentation in an end-to-end manner. These flexible long-range models usually require a large amount of labeled data to train. In order to address this issue, we discuss self-supervised techniques that learn representation effectively without human annotation in the second part of the dissertation. We regularize the contrastive learning framework with a consistency term that refines self-supervision signals. We also study a more general pretext task, masked image modeling, and train transformers to learn better representations with an online semantic tokenizer

    Similar works