Search CORE

314 research outputs found

Learning to Identify Ambiguous and Misleading News Headlines

Author: Wan Xiaojun
Wei Wei
Publication venue
Publication date: 29/08/2017
Field of study

Accuracy is one of the basic principles of journalism. However, it is increasingly hard to manage due to the diversity of news media. Some editors of online news tend to use catchy headlines which trick readers into clicking. These headlines are either ambiguous or misleading, degrading the reading experience of the audience. Thus, identifying inaccurate news headlines is a task worth studying. Previous work names these headlines "clickbaits" and mainly focus on the features extracted from the headlines, which limits the performance since the consistency between headlines and news bodies is underappreciated. In this paper, we clearly redefine the problem and identify ambiguous and misleading headlines separately. We utilize class sequential rules to exploit structure information when detecting ambiguous headlines. For the identification of misleading headlines, we extract features based on the congruence between headlines and bodies. To make use of the large unlabeled data set, we apply a co-training method and gain an increase in performance. The experiment results show the effectiveness of our methods. Then we use our classifiers to detect inaccurate headlines crawled from different sources and conduct a data analysis.Comment: Accepted by IJCAI 201

arXiv.org e-Print Archive

Crossref

An Empirical Study of Automatic Post-Editing

Author: Wan Xiaojun
Zhang Xu
Publication venue
Publication date: 16/09/2022
Field of study

Automatic post-editing (APE) aims to reduce manual post-editing efforts by automatically correcting errors in machine-translated output. Due to the limited amount of human-annotated training data, data scarcity is one of the main challenges faced by all APE systems. To alleviate the lack of genuine training data, most of the current APE systems employ data augmentation methods to generate large-scale artificial corpora. In view of the importance of data augmentation in APE, we separately study the impact of the construction method of artificial corpora and artificial data domain on the performance of APE models. Moreover, the difficulty of APE varies between different machine translation (MT) systems. We study the outputs of the state-of-art APE model on a difficult APE dataset to analyze the problems in existing APE systems. Primarily, we find that 1) Artificial corpora with high-quality source text and machine-translated text more effectively improve the performance of APE models; 2) In-domain artificial training data can better improve the performance of APE models, while irrelevant out-of-domain data actually interfere with the model; 3) Existing APE model struggles with cases containing long source text or high-quality machine-translated text; 4) The state-of-art APE model works well on grammatical and semantic addition problems, but the output is prone to entity and semantic omission errors.Comment: 14 pages, 4 figure

arXiv.org e-Print Archive

Visual Information Guided Zero-Shot Paraphrase Generation

Author: Lin Zhe
Wan Xiaojun
Publication venue
Publication date: 22/09/2022
Field of study

Zero-shot paraphrase generation has drawn much attention as the large-scale high-quality paraphrase corpus is limited. Back-translation, also known as the pivot-based method, is typical to this end. Several works leverage different information as "pivot" such as language, semantic representation and so on. In this paper, we explore using visual information such as image as the "pivot" of back-translation. Different with the pipeline back-translation method, we propose visual information guided zero-shot paraphrase generation (ViPG) based only on paired image-caption data. It jointly trains an image captioning model and a paraphrasing model and leverage the image captioning model to guide the training of the paraphrasing model. Both automatic evaluation and human evaluation show our model can generate paraphrase with good relevancy, fluency and diversity, and image is a promising kind of pivot for zero-shot paraphrase generation.Comment: Accepted By COLING 202

arXiv.org e-Print Archive

A Comprehensive Evaluation of Constrained Text Generation for Large Language Models

Author: Chen Xiang
Wan Xiaojun
Publication venue
Publication date: 24/10/2023
Field of study

Advancements in natural language generation (NLG) and large language models (LLMs) have led to proficient text generation in various tasks. However, integrating intricate constraints into neural text generation, due to LLMs' opacity, remains challenging. This study investigates constrained text generation for LLMs, where predefined constraints are applied during LLM's generation process. Our research examines multiple LLMs, including ChatGPT and GPT-4, categorizing constraints into lexical, structural, and relation-based types. We also present various benchmarks to facilitate fair evaluation. The study addresses some key research questions, including the extent of LLMs' compliance with constraints. Results illuminate LLMs' capacity and deficiency to incorporate constraints and provide insights for future developments in constrained text generation. Codes and datasets will be released upon acceptance.Comment: Work in progres

arXiv.org e-Print Archive