Preparing, restructuring, and augmenting a French treebank:

lexicalised parsers or coherent treebanks?

Schluter, Natalie; van Genabith, Josef

research

Preparing, restructuring, and augmenting a French treebank: lexicalised parsers or coherent treebanks?

Authors: Natalie Schluter
Josef van Genabith
Publication date: 1 January 2007
Publisher

Abstract

We present the Modified French Treebank (MFT), a completely revamped French Treebank, derived from the Paris 7 Treebank (P7T), which is cleaner, more coherent, has several transformed structures, and introduces new linguistic analyses. To determine the effect of these changes, we investigate how theMFT fares in statistical parsing. Probabilistic parsers trained on the MFT training set (currently 3800 trees) already perform better than their counterparts trained on five times the P7T data (18,548 trees), providing an extreme example of the importance of data quality over quantity in statistical parsing. Moreover, regression analysis on the learning curve of parsers trained on the MFT lead to the prediction that parsers trained on the full projected 18,548 tree MFT training set will far outscore their counterparts trained on the full P7T. These analyses also show how problematic data can lead to problematic conclusions–in particular, we find that lexicalisation in the probabilistic parsing of French is probably not as crucial as was once thought (Arun and Keller (2005))

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

DCU Online Research Access Service

oai:doras.dcu.ie:15265

Last time updated on 10/07/2013

Name not available

oai:doras.dcu.ie:15265

Last time updated on 09/02/2018

Irish Universities

Last time updated on 30/12/2017