Struc-Bench: Are Large Language Models Really Good at Generating Complex
  Structured Data?

Cohan, Arman; Gerstein, Mark; Phang, Jason; Tang, Xiangru; Zhao, Yilun; Zhou, Wangchunshu; Zong, Yiming

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Authors: Arman Cohan
Mark Gerstein
Jason Phang
Xiangru Tang
Yilun Zhao
Wangchunshu Zhou
Yiming Zong
Publication date: 4 April 2024
Publisher

Abstract

Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, and hallucination -- highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.08963

Last time updated on 09/10/2023