Recently, image-to-image translation methods based on contrastive learning
achieved state-of-the-art results in many tasks. However, the negatives are
sampled from the input feature spaces in the previous work, which makes the
negatives lack diversity. Moreover, in the latent space of the embedings,the
previous methods ignore domain consistency between the generated image and the
real images of target domain. In this paper, we propose a novel contrastive
learning framework for unpaired image-to-image translation, called MCCUT. We
utilize the multi-crop views to generate the negatives via the center-crop and
the random-crop, which can improve the diversity of negatives and meanwhile
increase the quality of negatives. To constrain the embedings in the deep
feature space,, we formulate a new domain consistency loss function, which
encourages the generated images to be close to the real images in the embedding
space of same domain. Furthermore, we present a dual coordinate channel
attention network by embedding positional information into SENet, which called
DCSE module. We employ the DCSE module in the design of generator, which makes
the generator pays more attention to channels with greater weight. In many
image-to-image translation tasks, our method achieves state-of-the-art results,
and the advantages of our method have been proved through extensive comparison
experiments and ablation research