4,233 research outputs found
A Continuously Growing Dataset of Sentential Paraphrases
A major challenge in paraphrase research is the lack of parallel corpora. In
this paper, we present a new method to collect large-scale sentential
paraphrases from Twitter by linking tweets through shared URLs. The main
advantage of our method is its simplicity, as it gets rid of the classifier or
human in the loop needed to select data before annotation and subsequent
application of paraphrase identification algorithms in the previous work. We
present the largest human-labeled paraphrase corpus to date of 51,524 sentence
pairs and the first cross-domain benchmarking for automatic paraphrase
identification. In addition, we show that more than 30,000 new sentential
paraphrases can be easily and continuously captured every month at ~70%
precision, and demonstrate their utility for downstream NLP tasks through
phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
A Large-Scale Comparison of Historical Text Normalization Systems
There is no consensus on the state-of-the-art approach to historical text
normalization. Many techniques have been proposed, including rule-based
methods, distance metrics, character-based statistical machine translation, and
neural encoder--decoder models, but studies have used different datasets,
different evaluation methods, and have come to different conclusions. This
paper presents the largest study of historical text normalization done so far.
We critically survey the existing literature and report experiments on eight
languages, comparing systems spanning all categories of proposed normalization
techniques, analysing the effect of training data quantity, and using different
evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201
Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling
International audienceNon-standardized languages are a challenge to the construction of representative linguistic resources and to the development of efficient natural language processing tools: when spelling is not determined by a consensual norm, a multiplicity of alternative written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourcing alternative spellings from which variation rules are automatically extracted. The rules are further used to match out-of-vocabulary words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without requiring manual rule definition by experts. We apply this multilingual methodology on Al-satian, a French regional language and provide (i) an intrinsic evaluation of the correctness of the obtained variants pairs, (ii) an extrinsic evaluation on a downstream task: part-of-speech tagging. We show that in a low-resource scenario, collecting spelling variants for only 145 words can lead to (i) the generation of 876 additional variant pairs, (ii) a diminution of out-of-vocabulary words improving the tagging performance by 1 to 4%
- …