Infrared and visible image fusion plays a vital role in the field of computer
vision. Previous approaches make efforts to design various fusion rules in the
loss functions. However, these experimental designed fusion rules make the
methods more and more complex. Besides, most of them only focus on boosting the
visual effects, thus showing unsatisfactory performance for the follow-up
high-level vision tasks. To address these challenges, in this letter, we
develop a semantic-level fusion network to sufficiently utilize the semantic
guidance, emancipating the experimental designed fusion rules. In addition, to
achieve a better semantic understanding of the feature fusion process, a fusion
block based on the transformer is presented in a multi-scale manner. Moreover,
we devise a regularization loss function, together with a training strategy, to
fully use semantic guidance from the high-level vision tasks. Compared with
state-of-the-art methods, our method does not depend on the hand-crafted fusion
loss function. Still, it achieves superior performance on visual quality along
with the follow-up high-level vision tasks