Large language models (LLM) have achieved remarkable performance on various
NLP tasks and are augmented by tools for broader applications. Yet, how to
evaluate and analyze the tool-utilization capability of LLMs is still
under-explored. In contrast to previous works that evaluate models
holistically, we comprehensively decompose the tool utilization into multiple
sub-processes, including instruction following, planning, reasoning, retrieval,
understanding, and review. Based on that, we further introduce T-Eval to
evaluate the tool utilization capability step by step. T-Eval disentangles the
tool utilization evaluation into several sub-domains along model capabilities,
facilitating the inner understanding of both holistic and isolated competency
of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of
various LLMs. T-Eval not only exhibits consistency with the outcome-oriented
evaluation but also provides a more fine-grained analysis of the capabilities
of LLMs, providing a new perspective in LLM evaluation on tool-utilization
ability. The benchmark will be available at
https://github.com/open-compass/T-Eval.Comment: Project: https://open-compass.github.io/T-Eva