Improving BERT with Hybrid Pooling Network and Drop Mask

Chen, Qian; Deng, Chong; Wang, Wen; Yukun, Ma; Zhang, Qinglin; Zheng, Siqi

Improving BERT with Hybrid Pooling Network and Drop Mask

Authors: Qian Chen
Chong Deng
Wen Wang
Ma Yukun
Qinglin Zhang
Siqi Zheng
Publication date: 14 July 2023
Publisher

Abstract

Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked Language Modeling pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning with 1.5% relative higher accuracies on downstream tasks. Additionally, DropMask improves accuracies of BERT on downstream tasks across various masking rates.Comment: 7 pages, 2 figure

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2307.07258

Last time updated on 20/07/2023