376 research outputs found
Multi-task Self-Supervised Visual Learning
We investigate methods for combining multiple self-supervised tasks--i.e.,
supervised tasks where data can be collected without manual labeling--in order
to train a single visual representation. First, we provide an apples-to-apples
comparison of four different self-supervised tasks using the very deep
ResNet-101 architecture. We then combine tasks to jointly train a network. We
also explore lasso regularization to encourage the network to factorize the
information in its representation, and methods for "harmonizing" network inputs
in order to learn a more unified representation. We evaluate all methods on
ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our
results show that deeper networks work better, and that combining tasks--even
via a naive multi-head architecture--always improves performance. Our best
joint network nearly matches the PASCAL performance of a model pre-trained on
ImageNet classification, and matches the ImageNet network on NYU depth
prediction.Comment: Published at ICCV 201
Exemplar-based Video Colorization with Long-term Spatiotemporal Dependency
Exemplar-based video colorization is an essential technique for applications
like old movie restoration. Although recent methods perform well in still
scenes or scenes with regular movement, they always lack robustness in moving
scenes due to their weak ability in modeling long-term dependency both
spatially and temporally, leading to color fading, color discontinuity or other
artifacts. To solve this problem, we propose an exemplar-based video
colorization framework with long-term spatiotemporal dependency. To enhance the
long-term spatial dependency, a parallelized CNN-Transformer block and a double
head non-local operation are designed. The proposed CNN-Transformer block can
better incorporate long-term spatial dependency with local texture and
structural features, and the double head non-local operation further leverages
the performance of augmented feature. While for long-term temporal dependency
enhancement, we further introduce the novel linkage subnet. The linkage subnet
propagate motion information across adjacent frame blocks and help to maintain
temporal continuity. Experiments demonstrate that our model outperforms recent
state-of-the-art methods both quantitatively and qualitatively. Also, our model
can generate more colorful, realistic and stabilized results, especially for
scenes where objects change greatly and irregularly
SVCNet: Scribble-based Video Colorization Network with Temporal Aggregation
In this paper, we propose a scribble-based video colorization network with
temporal aggregation called SVCNet. It can colorize monochrome videos based on
different user-given color scribbles. It addresses three common issues in the
scribble-based video colorization area: colorization vividness, temporal
consistency, and color bleeding. To improve the colorization quality and
strengthen the temporal consistency, we adopt two sequential sub-networks in
SVCNet for precise colorization and temporal smoothing, respectively. The first
stage includes a pyramid feature encoder to incorporate color scribbles with a
grayscale frame, and a semantic feature encoder to extract semantics. The
second stage finetunes the output from the first stage by aggregating the
information of neighboring colorized frames (as short-range connections) and
the first colorized frame (as a long-range connection). To alleviate the color
bleeding artifacts, we learn video colorization and segmentation
simultaneously. Furthermore, we set the majority of operations on a fixed small
image resolution and use a Super-resolution Module at the tail of SVCNet to
recover original sizes. It allows the SVCNet to fit different image resolutions
at the inference. Finally, we evaluate the proposed SVCNet on DAVIS and Videvo
benchmarks. The experimental results demonstrate that SVCNet produces both
higher-quality and more temporally consistent videos than other well-known
video colorization approaches. The codes and models can be found at
https://github.com/zhaoyuzhi/SVCNet.Comment: accepted by IEEE Transactions on Image Processing (TIP
A critical analysis of self-supervision, or what we can learn from a single image
We look critically at popular self-supervision techniques for learning deep
convolutional neural networks without manual labels. We show that three
different and representative methods, BiGAN, RotNet and DeepCluster, can learn
the first few layers of a convolutional network from a single image as well as
using millions of images and manual labels, provided that strong data
augmentation is used. However, for deeper layers the gap with manual
supervision cannot be closed even if millions of unlabelled images are used for
training. We conclude that: (1) the weights of the early layers of deep
networks contain limited information about the statistics of natural images,
that (2) such low-level statistics can be learned through self-supervision just
as well as through strong supervision, and that (3) the low-level statistics
can be captured via synthetic transformations instead of using a large image
dataset.Comment: Accepted paper at the International Conference on Learning
Representations (ICLR) 202
Pixelated Semantic Colorization
While many image colorization algorithms have recently shown the capability
of producing plausible color versions from gray-scale photographs, they still
suffer from limited semantic understanding. To address this shortcoming, we
propose to exploit pixelated object semantics to guide image colorization. The
rationale is that human beings perceive and distinguish colors based on the
semantic categories of objects. Starting from an autoregressive model, we
generate image color distributions, from which diverse colored results are
sampled. We propose two ways to incorporate object semantics into the
colorization model: through a pixelated semantic embedding and a pixelated
semantic generator. Specifically, the proposed convolutional neural network
includes two branches. One branch learns what the object is, while the other
branch learns the object colors. The network jointly optimizes a color
embedding loss, a semantic segmentation loss and a color generation loss, in an
end-to-end fashion. Experiments on PASCAL VOC2012 and COCO-stuff reveal that
our network, when trained with semantic segmentation labels, produces more
realistic and finer results compared to the colorization state-of-the-art
SPColor: Semantic Prior Guided Exemplar-based Image Colorization
Exemplar-based image colorization aims to colorize a target grayscale image
based on a color reference image, and the key is to establish accurate
pixel-level semantic correspondence between these two images. Previous methods
search for correspondence across the entire reference image, and this type of
global matching is easy to get mismatch. We summarize the difficulties in two
aspects: (1) When the reference image only contains a part of objects related
to target image, improper correspondence will be established in unrelated
regions. (2) It is prone to get mismatch in regions where the shape or texture
of the object is easily confused. To overcome these issues, we propose SPColor,
a semantic prior guided exemplar-based image colorization framework. Different
from previous methods, SPColor first coarsely classifies pixels of the
reference and target images to several pseudo-classes under the guidance of
semantic prior, then the correspondences are only established locally between
the pixels in the same class via the newly designed semantic prior guided
correspondence network. In this way, improper correspondence between different
semantic classes is explicitly excluded, and the mismatch is obviously
alleviated. Besides, to better reserve the color from reference, a similarity
masked perceptual loss is designed. Noting that the carefully designed SPColor
utilizes the semantic prior provided by an unsupervised segmentation model,
which is free for additional manual semantic annotations. Experiments
demonstrate that our model outperforms recent state-of-the-art methods both
quantitatively and qualitatively on public dataset
- …