14 research outputs found
Data-to-text Generation with Variational Sequential Planning
We consider the task of data-to-text generation, which aims to create textual
output from non-linguistic input. We focus on generating long-form text, i.e.,
documents with multiple paragraphs, and propose a neural model enhanced with a
planning component responsible for organizing high-level information in a
coherent and meaningful way. We infer latent plans sequentially with a
structured variational model, while interleaving the steps of planning and
generation. Text is generated by conditioning on previous variational decisions
and previously generated text. Experiments on two data-to-text benchmarks
(RotoWire and MLB) show that our model outperforms strong baselines and is
sample efficient in the face of limited training data (e.g., a few hundred
instances).Comment: To appear in Transactions of the Association for Computational
Linguistics (TACL); 18 page
Data-to-text generation with neural planning
In this thesis, we consider the task of data-to-text generation, which takes non-linguistic
structures as input and produces textual output. The inputs can take the form of
database tables, spreadsheets, charts, and so on. The main application of data-to-text
generation is to present information in a textual format which makes it accessible to
a layperson who may otherwise find it problematic to understand numerical figures.
The task can also automate routine document generation jobs, thus improving human
efficiency. We focus on generating long-form text, i.e., documents with multiple paragraphs. Recent approaches to data-to-text generation have adopted the very successful
encoder-decoder architecture or its variants. These models generate fluent (but often
imprecise) text and perform quite poorly at selecting appropriate content and ordering
it coherently. This thesis focuses on overcoming these issues by integrating content
planning with neural models. We hypothesize data-to-text generation will benefit from
explicit planning, which manifests itself in (a) micro planning, (b) latent entity planning, and (c) macro planning. Throughout this thesis, we assume the input to our
generator are tables (with records) in the sports domain. And the output are summaries
describing what happened in the game (e.g., who won/lost, ..., scored, etc.).
We first describe our work on integrating fine-grained or micro plans with data-to-text generation. As part of this, we generate a micro plan highlighting which records
should be mentioned and in which order, and then generate the document while taking
the micro plan into account.
We then show how data-to-text generation can benefit from higher level latent entity planning. Here, we make use of entity-specific representations which are dynam ically updated. The text is generated conditioned on entity representations and the
records corresponding to the entities by using hierarchical attention at each time step.
We then combine planning with the high level organization of entities, events, and
their interactions. Such coarse-grained macro plans are learnt from data and given
as input to the generator. Finally, we present work on making macro plans latent
while incrementally generating a document paragraph by paragraph. We infer latent
plans sequentially with a structured variational model while interleaving the steps of
planning and generation. Text is generated by conditioning on previous variational
decisions and previously generated text.
Overall our results show that planning makes data-to-text generation more interpretable, improves the factuality and coherence of the generated documents and re duces redundancy in the output document
Data-to-Text Generation with Content Selection and Planning
Recent advances in data-to-text generation have led to the use of large-scale
datasets and neural network models which are trained end-to-end, without
explicitly modeling what to say and in what order. In this work, we present a
neural network architecture which incorporates content selection and planning
without sacrificing end-to-end training. We decompose the generation task into
two stages. Given a corpus of data records (paired with descriptive documents),
we first generate a content plan highlighting which information should be
mentioned and in which order and then generate the document while taking the
content plan into account. Automatic and human-based evaluation experiments
show that our model outperforms strong baselines improving the state-of-the-art
on the recently released RotoWire dataset.Comment: Added link to cod
CTQScorer: Combining Multiple Features for In-context Example Selection for Machine Translation
Large language models have demonstrated the capability to perform on machine
translation when the input is prompted with a few examples (in-context
learning). Translation quality depends on various features of the selected
examples, such as their quality and relevance, but previous work has
predominantly focused on individual features in isolation. In this paper, we
propose a general framework for combining different features influencing
example selection. We learn a regression model, CTQ Scorer (Contextual
Translation Quality), that selects examples based on multiple features in order
to maximize the translation quality. On multiple language pairs and language
models, we show that CTQ Scorer helps significantly outperform random selection
as well as strong single-factor baselines reported in the literature. We also
see an improvement of over 2.5 COMET points on average with respect to a strong
BM25 retrieval-based baseline.Comment: Accepted to EMNLP 2023 finding
VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency
Large Language Models (LLMs) combined with program-based solving techniques
are increasingly demonstrating proficiency in mathematical reasoning. However,
such progress is mostly demonstrated in closed-source models such as
OpenAI-GPT4 and Claude. In this paper, we seek to study the performance of
strong open-source LLMs. Specifically, we analyze the outputs of Code Llama
(7B) when applied to math word problems. We identify a category of problems
that pose a challenge for the model, particularly those involving quantities
that span multiple types or units. To address this issue, we propose a
systematic approach by defining units for each quantity and ensuring the
consistency of these units during mathematical operations. We developed Unit
Consistency Programs (UCPs), an annotated dataset of math word problems, each
paired with programs that contain unit specifications and unit verification
routines. Finally, we finetune the Code Llama (7B) model with UCPs to produce
VerityMath and present our preliminary findings.Comment: Work in Progres
Multi-Document Summarization with Centroid-Based Pretraining
In multi-document summarization (MDS), the input is a cluster of documents,
and the output is the cluster summary. In this paper, we focus on pretraining
objectives for MDS. Specifically, we introduce a simple pretraining objective
of choosing the ROUGE-based centroid of each document cluster as a proxy for
its summary. Our objective thus does not require human written summaries and
can be used for pretraining on a dataset containing only clusters of documents.
Through zero-shot and fully supervised experiments on multiple MDS datasets, we
show that our model Centrum is better or comparable to a state-of-the-art
model. We release our pretrained and finetuned models at
https://github.com/ratishsp/centrum.Comment: 4 pages, work-in-progres
Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models
This study investigates machine translation between related languages i.e.,
languages within the same family that share linguistic characteristics such as
word order and lexical similarity. Machine translation through few-shot
prompting leverages a small set of translation pair examples to generate
translations for test sentences. This procedure requires the model to learn how
to generate translations while simultaneously ensuring that token ordering is
maintained to produce a fluent and accurate translation. We propose that for
related languages, the task of machine translation can be simplified by
leveraging the monotonic alignment characteristic of such languages. We
introduce DecoMT, a novel approach of few-shot prompting that decomposes the
translation process into a sequence of word chunk translations. Through
automatic and human evaluation conducted on multiple related language pairs
across various language families, we demonstrate that our proposed approach of
decomposed prompting surpasses multiple established few-shot baseline
approaches. For example, DecoMT outperforms the strong few-shot prompting BLOOM
model with an average improvement of 8 chrF++ scores across the examined
languages.Comment: EMNLP 2023 (Main, Long paper
IndicBART: A Pre-trained Model for Indic Natural Language Generation
In this paper, we study pre-trained sequence-to-sequence models for a group
of related languages, with a focus on Indic languages. We present IndicBART, a
multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic
languages and English. IndicBART utilizes the orthographic similarity between
Indic scripts to improve transfer learning between similar Indic languages. We
evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and
extreme summarization. Our experiments on NMT and extreme summarization show
that a model specific to related languages like IndicBART is competitive with
large pre-trained models like mBART50 despite being significantly smaller. It
also performs well on very low-resource translation scenarios where languages
are not included in pre-training or fine-tuning. Script sharing, multilingual
training, and better utilization of limited model capacity contribute to the
good performance of the compact IndicBART model.Comment: Published at ACL 2022, 15 page
IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
Natural Language Generation (NLG) for non-English languages is hampered by
the scarcity of datasets in these languages. In this paper, we present the
IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic
languages. We focus on five diverse tasks, namely, biography generation using
Wikipedia infoboxes, news headline generation, sentence summarization,
paraphrase generation and, question generation. We describe the created
datasets and use them to benchmark the performance of several monolingual and
multilingual baselines that leverage pre-trained sequence-to-sequence models.
Our results exhibit the strong performance of multilingual language-specific
pre-trained models, and the utility of models trained on our dataset for other
related NLG tasks. Our dataset creation methods can be easily applied to
modest-resource languages as they involve simple steps such as scraping news
articles and Wikipedia infoboxes, light cleaning, and pivoting through machine
translation data. To the best of our knowledge, the IndicNLG Benchmark is the
first NLG benchmark for Indic languages and the most diverse multilingual NLG
dataset, with approximately 8M examples across 5 tasks and 11 languages. The
datasets and models are publicly available at
https://ai4bharat.iitm.ac.in/indicnlg-suite.Comment: Accepted at EMNLP 202