Search CORE

2 research outputs found

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

Author: Ahn Jaesin
Hong Jiuk
Ju Jeongwoo
Jung Heechul
Publication venue: 'MDPI AG'
Publication date: 18/11/2021
Field of study

A vision transformer (ViT) is the dominant model in the computer vision field. Despite numerous studies that mainly focus on dealing with inductive bias and complexity, there remains the problem of finding better transformer networks. For example, conventional transformer-based models usually use a projection layer for each query (Q), key (K), and value (V) embedding before multi-head self-attention. Insufficient consideration of semantic

Q, K

, and

V

embedding may lead to a performance drop. In this paper, we propose three types of structures for

Q

K

, and

V

embedding. The first structure utilizes two layers with ReLU, which is a non-linear embedding for

Q, K

, and

V

. The second involves sharing one of the non-linear layers to share knowledge among

Q, K

, and

V

. The third proposed structure shares all non-linear layers with code parameters. The codes are trainable, and the values determine the embedding process to be performed among

Q

K

, and

V

. Hence, we demonstrate the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches. The proposed method achieved

71.4\%

with a few parameters (of

3.1M

) on the ImageNet-1k dataset compared to that required by the original transformer model of XCiT-N12 (

69.9\%

). Additionally, the method achieved

93.3\%

with only

2.9M

parameters in transfer learning on average for the CIFAR-10, CIFAR-100, Stanford Cars datasets, and STL-10 datasets, which is better than the accuracy of

92.2\%

obtained via the original XCiT-N12 model

arXiv.org e-Print Archive

Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers

Author: Heechul Jung
Jaesin Ahn
Jeongwoo Ju
Jiuk Hong
Publication venue: 'MDPI AG'
Publication date: 01/04/2023
Field of study

There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved 71.4%,77.8%, and 82.1% on ImageNet-1k compared with 69.9%,77.1%, and 82.0% acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved 94.8% in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (94.5%). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models

Directory of Open Access Journals