10 research outputs found
Weakly Supervised Content Selection for Improved Image Captioning
Image captioning involves identifying semantic concepts in the scene and
describing them in fluent natural language. Recent approaches do not explicitly
model the semantic concepts and train the model only for the end goal of
caption generation. Such models lack interpretability and controllability,
primarily due to sub-optimal content selection. We address this problem by
breaking down the captioning task into two simpler, manageable and more
controllable tasks -- skeleton prediction and skeleton-based caption
generation. We approach the former as a weakly supervised task, using a simple
off-the-shelf language syntax parser and avoiding the need for additional human
annotations; the latter uses a supervised-learning approach. We investigate
three methods of conditioning the caption on skeleton in the encoder, decoder
and both. Our compositional model generates significantly better quality
captions on out of domain test images, as judged by human annotators.
Additionally, we demonstrate the cross-language effectiveness of the English
skeleton to other languages including French, Italian, German, Spanish and
Hindi. This compositional nature of captioning exhibits the potential of
unpaired image captioning, thereby reducing the dependence on expensive
image-caption pairs. Furthermore, we investigate the use of skeletons as a knob
to control certain properties of the generated image caption, such as length,
content, and gender expression
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe
The GEM Benchmark:Natural Language Generation, its Evaluation and Metrics
We introduce GEM, a living benchmark for natural language Generation (NLG),
its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly
evolving ecosystem of automated metrics, datasets, and human evaluation
standards. Due to this moving target, new models often still evaluate on
divergent anglo-centric corpora with well-established, but flawed, metrics.
This disconnect makes it challenging to identify the limitations of current
models and opportunities for progress. Addressing this limitation, GEM provides
an environment in which models can easily be applied to a wide set of tasks and
in which evaluation strategies can be tested. Regular updates to the benchmark
will help NLG research become more multilingual and evolve the challenge
alongside models. This paper serves as the description of the data for which we
are organizing a shared task at our ACL 2021 Workshop and to which we invite
the entire NLG community to participate