Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method

Abstract

Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity while retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: (1) preservation of the original meaning, and (2) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models

    Similar works

    Full text

    thumbnail-image

    Available Versions