The derivation of mathematical results in specialised fields using Large
Language Models (LLMs) is an emerging research direction that can help identify
models' limitations, and potentially support mathematical discovery. In this
paper, we leverage a symbolic engine to generate derivations of equations at
scale, and investigate the capabilities of LLMs when deriving goal equations
from premises. Specifically, we employ in-context learning for GPT and
fine-tune a range of T5 models to compare the robustness and generalisation of
pre-training strategies to specialised models. Empirical results show that
fine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and
out-of-distribution test sets in terms of absolute performance. However, an
in-depth analysis reveals that the fine-tuned models are more sensitive to
perturbations involving unseen symbols and (to a lesser extent) changes to
equation structure. In addition, we analyse 1.7K equations and over 200
derivations to highlight common reasoning errors such as the inclusion of
incorrect, irrelevant, and redundant equations, along with the tendency to skip
derivation steps. Finally, we explore the suitability of existing metrics for
evaluating mathematical derivations finding evidence that, while they capture
general properties such as sensitivity to perturbations, they fail to highlight
fine-grained reasoning errors and essential differences between models.
Overall, this work demonstrates that training models on synthetic data can
improve their mathematical capabilities beyond larger architectures.Comment: 13 page