The emergence of Large Language Models (LLMs), such as ChatGPT, has
revolutionized general natural language preprocessing (NLP) tasks. However,
their expertise in the financial domain lacks a comprehensive evaluation. To
assess the ability of LLMs to solve financial NLP tasks, we present FinLMEval,
a framework for Financial Language Model Evaluation, comprising nine datasets
designed to evaluate the performance of language models. This study compares
the performance of encoder-only language models and the decoder-only language
models. Our findings reveal that while some decoder-only LLMs demonstrate
notable performance across most financial tasks via zero-shot prompting, they
generally lag behind the fine-tuned expert models, especially when dealing with
proprietary datasets. We hope this study provides foundation evaluations for
continuing efforts to build more advanced LLMs in the financial domain.Comment: Findings of EMNLP 2023 (short paper