Spurred by advancements in scale, large language models (LLMs) have
demonstrated the ability to perform a variety of natural language processing
(NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently,
the debut of ChatGPT has drawn a great deal of attention from the natural
language processing (NLP) community due to the fact that it can generate
high-quality responses to human input and self-correct previous mistakes based
on subsequent conversations. However, it is not yet known whether ChatGPT can
serve as a generalist model that can perform many NLP tasks zero-shot. In this
work, we empirically analyze the zero-shot learning ability of ChatGPT by
evaluating it on 20 popular NLP datasets covering 7 representative task
categories. With extensive empirical studies, we demonstrate both the
effectiveness and limitations of the current version of ChatGPT. We find that
ChatGPT performs well on many tasks favoring reasoning capabilities (e.g.,
arithmetic reasoning) while it still faces challenges when solving specific
tasks such as sequence tagging. We additionally provide in-depth analysis
through qualitative case studies