Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method

Ormaechea Grijalba, Lucía; Tsourakis, Nikolaos

Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method

Authors: Lucía Ormaechea Grijalba
Nikolaos Tsourakis
Publication date: 1 January 2023
Publisher

Abstract

Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity while retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: (1) preservation of the original meaning, and (2) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models

Similar works

Full text

Available Versions

Archive ouverte UNIGE

oai:unige.ch:aou:unige:169798

Last time updated on 07/06/2024