Large Language Models (LLMs) achieve impressive performance in a wide range
of tasks, even if they are often trained with the only objective of chatting
fluently with users. Among other skills, LLMs show emergent abilities in
mathematical reasoning benchmarks, which can be elicited with appropriate
prompting methods. In this work, we systematically investigate the capabilities
and limitations of popular open-source LLMs on different symbolic reasoning
tasks. We evaluate three models of the Llama 2 family on two datasets that
require solving mathematical formulas of varying degrees of difficulty. We test
a generalist LLM (Llama 2 Chat) as well as two fine-tuned versions of Llama 2
(MAmmoTH and MetaMath) specifically designed to tackle mathematical problems.
We observe that both increasing the scale of the model and fine-tuning it on
relevant tasks lead to significant performance gains. Furthermore, using
fine-grained evaluation measures, we find that such performance gains are
mostly observed with mathematical formulas of low complexity, which
nevertheless often remain challenging even for the largest fine-tuned models.Comment: Accepted at 33rd International Conference on Artificial Neural
Networks (ICANN24