1,679 research outputs found
LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training
Natural language generation from structured data mainly focuses on
surface-level descriptions, suffering from uncontrollable content selection and
low fidelity. Previous works leverage logical forms to facilitate logical
knowledge-conditioned text generation. Though achieving remarkable progress,
they are data-hungry, which makes the adoption for real-world applications
challenging with limited data. To this end, this paper proposes a unified
framework for logical knowledge-conditioned text generation in the few-shot
setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach
leverages self-training and samples pseudo logical forms based on content and
structure consistency. Experimental results demonstrate that our approach can
obtain better few-shot performance than baselines.Comment: Work in progres
A Call for Standardization and Validation of Text Style Transfer Evaluation
Text Style Transfer (TST) evaluation is, in practice, inconsistent.
Therefore, we conduct a meta-analysis on human and automated TST evaluation and
experimentation that thoroughly examines existing literature in the field. The
meta-analysis reveals a substantial standardization gap in human and automated
evaluation. In addition, we also find a validation gap: only few automated
metrics have been validated using human experiments. To this end, we thoroughly
scrutinize both the standardization and validation gap and reveal the resulting
pitfalls. This work also paves the way to close the standardization and
validation gap in TST evaluation by calling out requirements to be met by
future research.Comment: Accepted to Findings of ACL 202
Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation
Model hallucination has been a crucial interest of research in Natural
Language Generation (NLG). In this work, we propose sequence-level certainty as
a common theme over hallucination in NLG, and explore the correlation between
sequence-level certainty and the level of hallucination in model responses. We
categorize sequence-level certainty into two aspects: probabilistic certainty
and semantic certainty, and reveal through experiments on Knowledge-Grounded
Dialogue Generation (KGDG) task that both a higher level of probabilistic
certainty and a higher level of semantic certainty in model responses are
significantly correlated with a lower level of hallucination. What's more, we
provide theoretical proof and analysis to show that semantic certainty is a
good estimator of probabilistic certainty, and therefore has the potential as
an alternative to probability-based certainty estimation in black-box
scenarios. Based on the observation on the relationship between certainty and
hallucination, we further propose Certainty-based Response Ranking (CRR), a
decoding-time method for mitigating hallucination in NLG. Based on our
categorization of sequence-level certainty, we propose 2 types of CRR approach:
Probabilistic CRR (P-CRR) and Semantic CRR (S-CRR). P-CRR ranks individually
sampled model responses using their arithmetic mean log-probability of the
entire sequence. S-CRR approaches certainty estimation from meaning-space, and
ranks a number of model response candidates based on their semantic certainty
level, which is estimated by the entailment-based Agreement Score (AS). Through
extensive experiments across 3 KGDG datasets, 3 decoding methods, and on 4
different models, we validate the effectiveness of our 2 proposed CRR methods
to reduce model hallucination
Quantifying the Plausibility of Context Reliance in Neural Machine Translation
Establishing whether language models can use contextual information in a human-plausible way is important to ensure their safe adoption in real-world settings. However, the questions of when and which parts of the context affect model generations are typically tackled separately, and current plausibility evaluations are practically limited to a handful of artificial benchmarks. To address this, we introduce Plausibility Evaluation of Context Reliance (PECoRe), an end-to-end interpretability framework designed to quantify context usage in language models' generations. Our approach leverages model internals to (i) contrastively identify context-sensitive target tokens in generated texts and (ii) link them to contextual cues justifying their prediction. We use PECoRe to quantify the plausibility of context-aware machine translation models, comparing model rationales with human annotations across several discourse-level phenomena. Finally, we apply our method to unannotated generations to identify context-mediated predictions and highlight instances of (im)plausible context usage in model translations
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation
Multimodal machine translation (MMT) aims to improve translation quality by
incorporating information from other modalities, such as vision. Previous MMT
systems mainly focus on better access and use of visual information and tend to
validate their methods on image-related datasets. These studies face two
challenges. First, they can only utilize triple data (bilingual texts with
images), which is scarce; second, current benchmarks are relatively restricted
and do not correspond to realistic scenarios. Therefore, this paper
correspondingly establishes new methods and new datasets for MMT. First, we
propose a framework 2/3-Triplet with two new approaches to enhance MMT by
utilizing large-scale non-triple data: monolingual image-text data and parallel
text-only data. Second, we construct an English-Chinese {e}-commercial
{m}ulti{m}odal {t}ranslation dataset (including training and testing), named
EMMT, where its test set is carefully selected as some words are ambiguous and
shall be translated mistakenly without the help of images. Experiments show
that our method is more suitable for real-world scenarios and can significantly
improve translation performance by using more non-triple data. In addition, our
model also rivals various SOTA models in conventional multimodal translation
benchmarks.Comment: 8 pages, ACL 2023 Findin
LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting
Most NER methods rely on extensive labeled data for model training, which
struggles in the low-resource scenarios with limited training data. Existing
dominant approaches usually suffer from the challenge that the target domain
has different label sets compared with a resource-rich source domain, which can
be concluded as class transfer and domain transfer. In this paper, we propose a
lightweight tuning paradigm for low-resource NER via pluggable prompting
(LightNER). Specifically, we construct the unified learnable verbalizer of
entity categories to generate the entity span sequence and entity categories
without any label-specific classifiers, thus addressing the class transfer
issue. We further propose a pluggable guidance module by incorporating
learnable parameters into the self-attention layer as guidance, which can
re-modulate the attention and adapt pre-trained weights. Note that we only tune
those inserted module with the whole parameter of the pre-trained language
model fixed, thus, making our approach lightweight and flexible for
low-resource scenarios and can better transfer knowledge across domains.
Experimental results show that LightNER can obtain comparable performance in
the standard supervised setting and outperform strong baselines in low-resource
settings. Code is in
https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot.Comment: Accepted by COLING 202
An Investigation of Evaluation Metrics for Automated Medical Note Generation
Recent studies on automatic note generation have shown that doctors can save
significant amounts of time when using automatic clinical note generation
(Knoll et al., 2022). Summarization models have been used for this task to
generate clinical notes as summaries of doctor-patient conversations (Krishna
et al., 2021; Cai et al., 2022). However, assessing which model would best
serve clinicians in their daily practice is still a challenging task due to the
large set of possible correct summaries, and the potential limitations of
automatic evaluation metrics. In this paper, we study evaluation methods and
metrics for the automatic generation of clinical notes from medical
conversations. In particular, we propose new task-specific metrics and we
compare them to SOTA evaluation metrics in text summarization and generation,
including: (i) knowledge-graph embedding-based metrics, (ii) customized
model-based metrics, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble
metrics. To study the correlation between the automatic metrics and manual
judgments, we evaluate automatic notes/summaries by comparing the system and
reference facts and computing the factual correctness, and the hallucination
and omission rates for critical medical facts. This study relied on seven
datasets manually annotated by domain experts. Our experiments show that
automatic evaluation metrics can have substantially different behaviors on
different types of clinical notes datasets. However, the results highlight one
stable subset of metrics as the most correlated with human judgments with a
relevant aggregation of different evaluation criteria.Comment: Accepted to ACL Findings 202
- …