55 research outputs found
Deep Syntax Annotation of the Sequoia French Treebank
International audienceWe define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named DEEP - SEQUOIA, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis
Semi-Automatic Deep Syntactic Annotations of the French Treebank
International audienceWe describe and evaluate the semi-automatic addition of a deep syntactic layer to the French Treebank (Abeillé and Barrier [1]), using an existing scheme (Candito et al. [6]). While some rare or highly ambiguous deep phenomena are handled manually, the remainings are derived using a graph-rewriting system (Ribeyre et al. [22]). Although not manually corrected, we think the resulting Deep Representations can pave the way for the emergence of deep syntactic parsers for French
Corpus annotation within the French FrameNet: a domain-by-domain methodology
International audienceThis paper reports on the development of a French FrameNet, within the ASFALDA project. While the first phase of the project focused on the development of a French set of frames and corresponding lexicon (Candito et al., 2014), this paper concentrates on the subsequent corpus annotation phase, which focused on four notional domains (commercial transactions, cognitive stances, causality and verbal communication). Given full coverage is not reachable for a relatively " new " FrameNet project, we advocate that focusing on specific notional domains allowed us to obtain full lexical coverage for the frames of these domains, while partially reflecting word sense ambiguities. Furthermore, as frames and roles were annotated on two French Treebanks (the French Treebank (Abeillé and Barrier, 2004) and the Sequoia Treebank (Candito and Seddah, 2012), we were able to extract a syntactico-semantic lexicon from the annotated frames. In the resource's current status, there are 98 frames, 662 frame-evoking words, 872 senses, and about 13000 annotated frames, with their semantic roles assigned to portions of text. The French FrameNet is freely available at alpage.inria.fr/asfalda
Semi-Automatic Deep Syntactic Annotations of the French Treebank
International audienceWe describe and evaluate the semi-automatic addition of a deep syntactic layer to the French Treebank (Abeillé and Barrier [1]), using an existing scheme (Candito et al. [6]). While some rare or highly ambiguous deep phenomena are handled manually, the remainings are derived using a graph-rewriting system (Ribeyre et al. [22]). Although not manually corrected, we think the resulting Deep Representations can pave the way for the emergence of deep syntactic parsers for French
An improved neural network model for joint POS tagging and dependency parsing
We propose a novel neural network model for joint part-of-speech (POS)
tagging and dependency parsing. Our model extends the well-known BIST
graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating
a BiLSTM-based tagging component to produce automatically predicted POS tags
for the parser. On the benchmark English Penn treebank, our model obtains
strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+%
absolute improvements to the BIST graph-based parser, and also obtaining a
state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental
results on parsing 61 "big" Universal Dependencies treebanks from raw texts
show that our model outperforms the baseline UDPipe (Straka and Strakov\'a,
2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS
score. In addition, with our model, we also obtain state-of-the-art downstream
task scores for biomedical event extraction and opinion analysis applications.
Our code is available together with all pre-trained models at:
https://github.com/datquocnguyen/jPTDPComment: 11 pages; In Proceedings of the CoNLL 2018 Shared Task: Multilingual
Parsing from Raw Text to Universal Dependencies, to appea
Tour d'Horizon du French QuestionBank : Construire un Corpus Arboré de Questions pour le Français
National audienceWe present the French QuestionBank, a treebank of 2600 questions, annotated with dependencyphrase-based structures. Two thirds being aligned with the English QuestionBank (Judge et al., 2006)and being freely available, this treebank will prove useful to build robust NLP systems. We alsodiscuss the development costs of such ressources.Nous présentons le French QuestionBank, un corpus arboré composé de 2600 questions annotées en dépendances et en constituants. Les deux tiers étant alignés avec le QuestionBank de l'anglais (Judge et al., 2006), libre de droits, ce corpus saura prouver son utilité pour construire des systÚmes d'analyse robuste. Nous discutons aussi des coûts de développement de tels corpus
A type-logical treebank for French
International audienceThe goal of the current paper is to describe the TLGbank, a treebank of type-logical proof semi-automatically extracted from the French Treebank. Though the framework chosen for the treebank are multimodal type-logical grammars, we have ensured that the analysis is compatible with other mondern type-logical grammars, such the displacement calculus and first-order linear logic. We describe the extraction procedure, analyse first results and compare the treebank to the CCGbank
Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax
International audienceThis article presents the results we obtained on a complex annotation task (that of dependency syntax) using a specifically designed Game with a Purpose, ZombiLingo. We show that with suitable mechanisms (decomposition of the task, training of the players and regular control of the annotation quality during the game), it is possible to obtain annotations whose quality is significantly higher than that obtainable with a parser, provided that enough players participate. The source code of the game and the resulting annotated corpora (for French) are freely available
Un schéma d'annotation en dépendances syntaxiques profondes pour le français
International audienceWe describe in this article an annotation scheme for deep dependency syntax, built from the surface annotation scheme of the Sequoia corpus, abstracting away from it and expressing the grammatical relations between content words. When these grammatical relations take part into verbal diatheses, we consider the diatheses as resulting from redistributions from the canonical diathesis, which we retain in our annotation scheme.à partir du schéma d'annotation en dépendances syntaxiques de surface du corpus Sequoia, nous proposons un schéma en dépendances syntaxiques profondes qui en est une abstraction exprimant les relations grammaticales entre mots sémantiquement pleins. Quand ces relations grammaticales sont partie prenante de diathÚses verbales, ces diathÚses sont vues comme le résultat de redistributions à partir d'une diathÚse canonique et c'est cette derniÚre qui est retenue dans notre schéma d'annotation syntaxique profonde
CamemBERT: a Tasty French Language Model
Pretrained language models are now ubiquitous in Natural Language Processing.
Despite their success, most available models have either been trained on
English data or on the concatenation of data in multiple languages. This makes
practical use of such models --in all languages except English-- very limited.
In this paper, we investigate the feasibility of training monolingual
Transformer-based language models for other languages, taking French as an
example and evaluating our language models on part-of-speech tagging,
dependency parsing, named entity recognition and natural language inference
tasks. We show that the use of web crawled data is preferable to the use of
Wikipedia data. More surprisingly, we show that a relatively small web crawled
dataset (4GB) leads to results that are as good as those obtained using larger
datasets (130+GB). Our best performing model CamemBERT reaches or improves the
state of the art in all four downstream tasks.Comment: ACL 2020 long paper. Web site: https://camembert-model.f
- âŠ