Transformer-based pre-trained language models, such as BERT, achieve great
success in various natural language understanding tasks. Prior research found
that BERT captures a rich hierarchy of linguistic information at different
layers. However, the vanilla BERT uses the same self-attention mechanism for
each layer to model the different contextual features. In this paper, we
propose a HybridBERT model which combines self-attention and pooling networks
to encode different contextual features in each layer. Additionally, we propose
a simple DropMask method to address the mismatch between pre-training and
fine-tuning caused by excessive use of special mask tokens during Masked
Language Modeling pre-training. Experiments show that HybridBERT outperforms
BERT in pre-training with lower loss, faster training speed (8% relative),
lower memory cost (13% relative), and also in transfer learning with 1.5%
relative higher accuracies on downstream tasks. Additionally, DropMask improves
accuracies of BERT on downstream tasks across various masking rates.Comment: 7 pages, 2 figure