314 research outputs found
Learning to Identify Ambiguous and Misleading News Headlines
Accuracy is one of the basic principles of journalism. However, it is
increasingly hard to manage due to the diversity of news media. Some editors of
online news tend to use catchy headlines which trick readers into clicking.
These headlines are either ambiguous or misleading, degrading the reading
experience of the audience. Thus, identifying inaccurate news headlines is a
task worth studying. Previous work names these headlines "clickbaits" and
mainly focus on the features extracted from the headlines, which limits the
performance since the consistency between headlines and news bodies is
underappreciated. In this paper, we clearly redefine the problem and identify
ambiguous and misleading headlines separately. We utilize class sequential
rules to exploit structure information when detecting ambiguous headlines. For
the identification of misleading headlines, we extract features based on the
congruence between headlines and bodies. To make use of the large unlabeled
data set, we apply a co-training method and gain an increase in performance.
The experiment results show the effectiveness of our methods. Then we use our
classifiers to detect inaccurate headlines crawled from different sources and
conduct a data analysis.Comment: Accepted by IJCAI 201
An Empirical Study of Automatic Post-Editing
Automatic post-editing (APE) aims to reduce manual post-editing efforts by
automatically correcting errors in machine-translated output. Due to the
limited amount of human-annotated training data, data scarcity is one of the
main challenges faced by all APE systems. To alleviate the lack of genuine
training data, most of the current APE systems employ data augmentation methods
to generate large-scale artificial corpora. In view of the importance of data
augmentation in APE, we separately study the impact of the construction method
of artificial corpora and artificial data domain on the performance of APE
models. Moreover, the difficulty of APE varies between different machine
translation (MT) systems. We study the outputs of the state-of-art APE model on
a difficult APE dataset to analyze the problems in existing APE systems.
Primarily, we find that 1) Artificial corpora with high-quality source text and
machine-translated text more effectively improve the performance of APE models;
2) In-domain artificial training data can better improve the performance of APE
models, while irrelevant out-of-domain data actually interfere with the model;
3) Existing APE model struggles with cases containing long source text or
high-quality machine-translated text; 4) The state-of-art APE model works well
on grammatical and semantic addition problems, but the output is prone to
entity and semantic omission errors.Comment: 14 pages, 4 figure
Visual Information Guided Zero-Shot Paraphrase Generation
Zero-shot paraphrase generation has drawn much attention as the large-scale
high-quality paraphrase corpus is limited. Back-translation, also known as the
pivot-based method, is typical to this end. Several works leverage different
information as "pivot" such as language, semantic representation and so on. In
this paper, we explore using visual information such as image as the "pivot" of
back-translation. Different with the pipeline back-translation method, we
propose visual information guided zero-shot paraphrase generation (ViPG) based
only on paired image-caption data. It jointly trains an image captioning model
and a paraphrasing model and leverage the image captioning model to guide the
training of the paraphrasing model. Both automatic evaluation and human
evaluation show our model can generate paraphrase with good relevancy, fluency
and diversity, and image is a promising kind of pivot for zero-shot paraphrase
generation.Comment: Accepted By COLING 202
A Comprehensive Evaluation of Constrained Text Generation for Large Language Models
Advancements in natural language generation (NLG) and large language models
(LLMs) have led to proficient text generation in various tasks. However,
integrating intricate constraints into neural text generation, due to LLMs'
opacity, remains challenging. This study investigates constrained text
generation for LLMs, where predefined constraints are applied during LLM's
generation process. Our research examines multiple LLMs, including ChatGPT and
GPT-4, categorizing constraints into lexical, structural, and relation-based
types. We also present various benchmarks to facilitate fair evaluation. The
study addresses some key research questions, including the extent of LLMs'
compliance with constraints. Results illuminate LLMs' capacity and deficiency
to incorporate constraints and provide insights for future developments in
constrained text generation. Codes and datasets will be released upon
acceptance.Comment: Work in progres
- …