2 research outputs found
Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints
A vision transformer (ViT) is the dominant model in the computer vision
field. Despite numerous studies that mainly focus on dealing with inductive
bias and complexity, there remains the problem of finding better transformer
networks. For example, conventional transformer-based models usually use a
projection layer for each query (Q), key (K), and value (V) embedding before
multi-head self-attention. Insufficient consideration of semantic , and
embedding may lead to a performance drop. In this paper, we propose three
types of structures for , , and embedding. The first structure
utilizes two layers with ReLU, which is a non-linear embedding for , and
. The second involves sharing one of the non-linear layers to share
knowledge among , and . The third proposed structure shares all
non-linear layers with code parameters. The codes are trainable, and the values
determine the embedding process to be performed among , , and . Hence,
we demonstrate the superior image classification performance of the proposed
approaches in experiments compared to several state-of-the-art approaches. The
proposed method achieved with a few parameters (of ) on the
ImageNet-1k dataset compared to that required by the original transformer model
of XCiT-N12 (). Additionally, the method achieved with only
parameters in transfer learning on average for the CIFAR-10, CIFAR-100,
Stanford Cars datasets, and STL-10 datasets, which is better than the accuracy
of obtained via the original XCiT-N12 model
Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers
There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved 71.4%,77.8%, and 82.1% on ImageNet-1k compared with 69.9%,77.1%, and 82.0% acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved 94.8% in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (94.5%). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models