2 research outputs found

    Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

    Full text link
    A vision transformer (ViT) is the dominant model in the computer vision field. Despite numerous studies that mainly focus on dealing with inductive bias and complexity, there remains the problem of finding better transformer networks. For example, conventional transformer-based models usually use a projection layer for each query (Q), key (K), and value (V) embedding before multi-head self-attention. Insufficient consideration of semantic Q,KQ, K, and VV embedding may lead to a performance drop. In this paper, we propose three types of structures for QQ, KK, and VV embedding. The first structure utilizes two layers with ReLU, which is a non-linear embedding for Q,KQ, K, and VV. The second involves sharing one of the non-linear layers to share knowledge among Q,KQ, K, and VV. The third proposed structure shares all non-linear layers with code parameters. The codes are trainable, and the values determine the embedding process to be performed among QQ, KK, and VV. Hence, we demonstrate the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches. The proposed method achieved 71.4%71.4\% with a few parameters (of 3.1M3.1M) on the ImageNet-1k dataset compared to that required by the original transformer model of XCiT-N12 (69.9%69.9\%). Additionally, the method achieved 93.3%93.3\% with only 2.9M2.9M parameters in transfer learning on average for the CIFAR-10, CIFAR-100, Stanford Cars datasets, and STL-10 datasets, which is better than the accuracy of 92.2%92.2\% obtained via the original XCiT-N12 model

    Redesigning Embedding Layers for Queries, Keys, and Values in Cross-Covariance Image Transformers

    No full text
    There are several attempts in vision transformers to reduce quadratic time complexity to linear time complexity according to increases in the number of tokens. Cross-covariance image transformers (XCiT) are also one of the techniques utilized to address the issue. However, despite these efforts, the increase in token dimensions still results in quadratic growth in time complexity, and the dimension is a key parameter for achieving superior generalization performance. In this paper, a novel method is proposed to improve the generalization performances of XCiT models without increasing token dimensions. We redesigned the embedding layers of queries, keys, and values, such as separate non-linear embedding (SNE), partially-shared non-linear embedding (P-SNE), and fully-shared non-linear embedding (F-SNE). Finally, a proposed structure with different model size settings achieved 71.4%,77.8%, and 82.1% on ImageNet-1k compared with 69.9%,77.1%, and 82.0% acquired by the original XCiT models, namely XCiT-N12, XCiT-T12, and XCiT-S12, respectively. Additionally, the proposed model achieved 94.8% in transfer learning experiments, on average, for CIFAR-10, CIFAR-100, Stanford Cars, and STL-10, which is superior to the baseline model of XCiT-S12 (94.5%). In particular, the proposed models demonstrated considerable improvements on the out-of-distribution detection task compared to the original XCiT models
    corecore