87 research outputs found
HyperTuning: Toward Adapting Large Language Models without Back-propagation
Fine-tuning large language models for different tasks can be costly and
inefficient, and even methods that reduce the number of tuned parameters still
require full gradient-based optimization. We propose HyperTuning, a novel
approach to model adaptation that uses a hypermodel to generate task-specific
parameters for a fixed downstream model. We demonstrate a simple setup for
hypertuning with HyperT5, a T5-based hypermodel that produces soft prefixes or
LoRA parameters for a frozen T5 model from few-shot examples. We train HyperT5
in two stages: first, hyperpretraining with a modified conditional language
modeling objective that trains a hypermodel to generate parameters; second,
multi-task fine-tuning (MTF) on a large number of diverse language tasks. We
evaluate HyperT5 on P3, MetaICL and Super-NaturalInstructions datasets, and
show that it can effectively generate parameters for unseen tasks. Moreover, we
show that using hypermodel-generated parameters as initializations for further
parameter-efficient fine-tuning improves performance. HyperTuning can thus be a
flexible and efficient way to leverage large language models for diverse
downstream applications
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
Despite the remarkable capabilities of Large Language Models (LLMs) like
GPT-4, producing complex, structured tabular data remains challenging. Our
study assesses LLMs' proficiency in structuring tables and introduces a novel
fine-tuning method, cognizant of data structures, to bolster their performance.
We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs
(GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and
LaTeX formats. Our proposed FormatCoT aids in crafting format-specific
instructions from the intended outputs to populate this benchmark. Addressing
the gap in task-centered evaluation, we propose two innovative metrics, P-Score
(Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM
performance. Our experiments show that applying our structure-aware fine-tuning
to LLaMA-7B leads to substantial performance gains, outshining its LLM
counterparts across most measures. In-depth error analysis and creating an
ability map across six dimensions -- coverage, formatting, reasoning,
comprehension, pragmatics, and hallucination -- highlight areas for future
enhancements and suggest forthcoming research trajectories. Our code and models
can be found at https://github.com/gersteinlab/Struc-Bench
Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Large language models (LLMs) have achieved widespread success on a variety of
in-context few-shot tasks, but this success is typically evaluated via
correctness rather than consistency. We argue that self-consistency is an
important criteria for valid multi-step reasoning in tasks where the solution
is composed of the answers to multiple sub-steps. We propose two types of
self-consistency that are particularly important for multi-step reasoning --
hypothetical consistency (a model's ability to predict what its output would be
in a hypothetical other context) and compositional consistency (consistency of
a model's final outputs when intermediate sub-steps are replaced with the
model's outputs for those steps). We demonstrate that multiple variants of the
GPT-3/-4 models exhibit poor consistency rates across both types of consistency
on a variety of tasks.Comment: Added GPT-4 result
- …
