Thanks to the advancement of deep learning technology, vision transformer has
demonstrated competitive performance in various computer vision tasks.
Unfortunately, vision transformer still faces some challenges such as high
computational complexity and absence of desirable inductive bias. To alleviate
these problems, a novel Bi-Fovea Self-Attention (BFSA) is proposed, inspired by
the physiological structure and characteristics of bi-fovea vision in eagle
eyes. This BFSA can simulate the shallow fovea and deep fovea functions of
eagle vision, enable the network to extract feature representations of targets
from coarse to fine, facilitate the interaction of multi-scale feature
representations. Additionally, a Bionic Eagle Vision (BEV) block based on BFSA
is designed in this study. It combines the advantages of CNNs and Vision
Transformers to enhance the ability of global and local feature representations
of networks. Furthermore, a unified and efficient general pyramid backbone
network family is developed by stacking the BEV blocks in this study, called
Eagle Vision Transformers (EViTs). Experimental results on various computer
vision tasks including image classification, object detection, instance
segmentation and other transfer learning tasks show that the proposed EViTs
perform effectively by comparing with the baselines under same model size and
exhibit higher speed on graphics processing unit than other models. Code is
available at https://github.com/nkusyl/EViT.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl