While GPTs with traditional fine-tuning fail to achieve strong results on
natural language understanding (NLU), we show that GPTs can be better than or
comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning --
which employs trainable continuous prompt embeddings. On the knowledge probing
(LAMA) benchmark, the best GPT recovers 64\% (P@1) of world knowledge without
any additional text provided during test time, which substantially improves the
previous best by 20+ percentage points. On the SuperGlue benchmark, GPTs
achieve comparable and sometimes better performance to similar-sized BERTs in
supervised learning. Importantly, we find that P-tuning also improves BERTs'
performance in both few-shot and supervised settings while largely reducing the
need for prompt engineering. Consequently, P-tuning outperforms the
state-of-the-art approaches on the few-shot SuperGlue benchmark