22 research outputs found
A Call for Standardization and Validation of Text Style Transfer Evaluation
Text Style Transfer (TST) evaluation is, in practice, inconsistent.
Therefore, we conduct a meta-analysis on human and automated TST evaluation and
experimentation that thoroughly examines existing literature in the field. The
meta-analysis reveals a substantial standardization gap in human and automated
evaluation. In addition, we also find a validation gap: only few automated
metrics have been validated using human experiments. To this end, we thoroughly
scrutinize both the standardization and validation gap and reveal the resulting
pitfalls. This work also paves the way to close the standardization and
validation gap in TST evaluation by calling out requirements to be met by
future research.Comment: Accepted to Findings of ACL 202
Deep Learning for Text Style Transfer: A Survey
Text style transfer is an important task in natural language generation,
which aims to control certain attributes in the generated text, such as
politeness, emotion, humor, and many others. It has a long history in the field
of natural language processing, and recently has re-gained significant
attention thanks to the promising performance brought by deep neural models. In
this paper, we present a systematic survey of the research on neural text style
transfer, spanning over 100 representative articles since the first neural text
style transfer work in 2017. We discuss the task formulation, existing datasets
and subtasks, evaluation, as well as the rich methodologies in the presence of
parallel and non-parallel data. We also provide discussions on a variety of
important topics regarding the future development of this task. Our curated
paper list is at https://github.com/zhijing-jin/Text_Style_Transfer_SurveyComment: Computational Linguistics Journal 202
Gamma Sampling: Fine-grained Controlling Language Models without Training
The dominant approaches for controlling language models achieve prominence in
controlling high-level attributes (e.g. topic and sentiment). However, these
methods often require condition-specific data or are computationally expensive.
We propose a new simple guided decoding method, Gamma Sampling, which does not
require any training data to achieve fine-grained controllable text generation
while maintaining a fast generation speed. Gamma Sampling introduces
attribute-related information (provided by humans or language models
themselves) into the sampling process to guide language models to generate
texts with desired attributes. Since no training is involved, Gamma Sampling
can be easily applied to any language model for controllable text generation.
Through experiments, we show that Gamma Sampling-steered GPT2-small (117M)
outperforms baselines such as PPLM (345M) and CTRL (1.6B) in diversity,
attribute relevance, and overall quality of generated samples.Comment: 20 pages, 5 figure
Recommended from our members
Evaluating the Evaluators: Subjective Bias and Consistency in Human Evaluation of Natural Language Generation
The Natural Language Generation (NLG) community relies on shared evaluation techniques to understand progress in the field. Based on an analysis of papers published over 10 years (from 2008 to 2018) in NLG-specific conferences and on an observational study, this thesis identifies shortcomings with existing approaches to reporting the reliability of evaluation studies in NLG. It proposes a new set of methods for identifying judges' bias and reporting reliability, specifically for human intrinsic evaluation of NLG systems.
In this thesis, we propose to use the correlation statistic and Item Response Theory (IRT) to analyse judges' bias for cases that involve a high level of language variability. Both techniques provide insights about the trustability of human judgements. Whereas the correlation statistic offers an approach to measure judges' relative consistency, IRT provides a tool to identify judges' bias.
We found support for the use of the correlation statistic through three case studies that show the limits of considering agreement coefficients as the only criterion for checking evaluation reliability. Given the variability of human language --- specifically variability in language interpretation and quality judgement --- expecting judges to always arrive at exactly the same judgement seems both unrealistic and over-constrained. The correlation coefficients can
be used to measure the extent to which judges follow a systematic pattern in their assessments, even when their individual interpretations of the phenomena are not identical.
Regarding IRT, we introduce a new interpretation and application of the technique to describe judges' bias. Using the QG-STEC evaluation dataset, and applying IRT to each judge, we show how to use IRT's probabilistic analysis to compare judges' bias and as result better characterize annotation disagreement. The new approach that we propose, can be used, for example, to spot judges who are outliers, improve annotation guidelines and arrive at an improved interpretation of the agreement coefficients
Towards Verifiable Text Generation with Symbolic References
Large language models (LLMs) have demonstrated an impressive ability to
synthesize plausible and fluent text. However they remain vulnerable to
hallucinations, and thus their outputs generally require manual human
verification for high-stakes applications, which can be time-consuming and
difficult. This paper proposes symbolically grounded generation (SymGen) as a
simple approach for enabling easier validation of an LLM's output. SymGen
prompts an LLM to interleave its regular output text with explicit symbolic
references to fields present in some conditioning data (e.g., a table in JSON
format). The references can be used to display the provenance of different
spans of text in the generation, reducing the effort required for manual
verification. Across data-to-text and question answering experiments, we find
that LLMs are able to directly output text that makes use of symbolic
references while maintaining fluency and accuracy.Comment: 46 pages, 4 figures, 6 table
PaLM: Scaling Language Modeling with Pathways
Large language models have been shown to achieve remarkable performance
across a variety of natural language tasks using few-shot learning, which
drastically reduces the number of task-specific training examples needed to
adapt the model to a particular application. To further our understanding of
the impact of scale on few-shot learning, we trained a 540-billion parameter,
densely activated, Transformer language model, which we call Pathways Language
Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML
system which enables highly efficient training across multiple TPU Pods. We
demonstrate continued benefits of scaling by achieving state-of-the-art
few-shot learning results on hundreds of language understanding and generation
benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough
performance, outperforming the finetuned state-of-the-art on a suite of
multi-step reasoning tasks, and outperforming average human performance on the
recently released BIG-bench benchmark. A significant number of BIG-bench tasks
showed discontinuous improvements from model scale, meaning that performance
steeply increased as we scaled to our largest model. PaLM also has strong
capabilities in multilingual tasks and source code generation, which we
demonstrate on a wide array of benchmarks. We additionally provide a
comprehensive analysis on bias and toxicity, and study the extent of training
data memorization with respect to model scale. Finally, we discuss the ethical
considerations related to large language models and discuss potential
mitigation strategies
Semantic consistency in text generation
Automatic input-grounded text generation tasks process input texts and generate human-understandable natural language text for the processed information. The development
of neural sequence-to-sequence (seq2seq) models, which are usually trained in an end-to-end fashion, pushed the frontier of the performance on text generation tasks expeditiously. However, they are claimed to be defective in semantic consistency w.r.t. their
corresponding input texts. Also, not only the models are to blame. The corpora themselves always include examples whose output is semantically inconsistent to its input.
Any model that is agnostic to such data divergence issues will be prone to semantic inconsistency. Meanwhile, the most widely-used overlap-based evaluation metrics
comparing the generated texts to their corresponding references do not evaluate the
input-output semantic consistency explicitly, which makes this problem hard to detect.
In this thesis, we focus on studying semantic consistency in three automatic text
generation scenarios: Data-to-text Generation, Single Document Abstractive Summarization, and Chit-chat Dialogue Generation, by seeking for the answers to the following research questions: (1) how to define input-output semantic consistency in different
text generation tasks? (2) how to quantitatively evaluate the input-output semantic
consistency? (3) how to achieve better semantic consistency in individual tasks?
We systematically define the semantic inconsistency phenomena in these three
tasks as omission, intrinsic hallucination, and extrinsic hallucination. For Data-to-text Generation, we jointly learn a sentence planner that tightly controls which part
of input source gets generated in what sequence, with a neural seq2seq text generator,
to decrease all three types of semantic inconsistency in model-generated texts. The
evaluation results confirm that the texts generated by our model contain much less
omissions while maintaining low level of extrinsic hallucinations without sacrificing
fluency compared to seq2seq models. For Single Document Abstractive Summarization, we reduce the level of extrinsic hallucinations in training data by automatically
introducing assisting articles to each document-summary instance to provide the supplemental world-knowledge that is present in the summary but missing from the doc ument. With the help of a novel metric, we show that seq2seq models trained with as sisting articles demonstrate less extrinsic hallucinations than the ones trained without
them. For Chit-chat Dialogue Generation, by filtering out the omitted and hallucinated
examples from training set using a newly introduced evaluation metric, and encoding
it into the neural seq2seq response generation models as a control factor, we diminish
the level of omissions and extrinsic hallucinations in the generated dialogue responses