Search CORE

55 research outputs found

Deep Syntax Annotation of the Sequoia French Treebank

Author: Candito Marie
Fort Karen
Guillaume Bruno
Perrier Guy
Ribeyre Corentin
Seddah Djamé
Villemonte de La Clergerie Éric
Publication venue: HAL CCSD
Publication date: 26/05/2014
Field of study

International audienceWe define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named DEEP - SEQUOIA, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis

INRIA a CCSD electronic archive server

HAL Descartes

Semi-Automatic Deep Syntactic Annotations of the French Treebank

Author: Candito Marie
Ribeyre Corentin
Seddah Djamé
Publication venue: HAL CCSD
Publication date: 12/12/2014
Field of study

International audienceWe describe and evaluate the semi-automatic addition of a deep syntactic layer to the French Treebank (Abeillé and Barrier [1]), using an existing scheme (Candito et al. [6]). While some rare or highly ambiguous deep phenomena are handled manually, the remainings are derived using a graph-rewriting system (Ribeyre et al. [22]). Although not manually corrected, we think the resulting Deep Representations can pave the way for the emergence of deep syntactic parsers for French

INRIA a CCSD electronic archive server

Hal-Diderot

Corpus annotation within the French FrameNet: a domain-by-domain methodology

Author: Candito Marie
Djemaa Marianne
Muller Philippe
Vieu Laure
Publication venue: HAL CCSD
Publication date: 01/01/2016
Field of study

International audienceThis paper reports on the development of a French FrameNet, within the ASFALDA project. While the first phase of the project focused on the development of a French set of frames and corresponding lexicon (Candito et al., 2014), this paper concentrates on the subsequent corpus annotation phase, which focused on four notional domains (commercial transactions, cognitive stances, causality and verbal communication). Given full coverage is not reachable for a relatively " new " FrameNet project, we advocate that focusing on specific notional domains allowed us to obtain full lexical coverage for the frames of these domains, while partially reflecting word sense ambiguities. Furthermore, as frames and roles were annotated on two French Treebanks (the French Treebank (Abeillé and Barrier, 2004) and the Sequoia Treebank (Candito and Seddah, 2012), we were able to extract a syntactico-semantic lexicon from the annotated frames. In the resource's current status, there are 98 frames, 662 frame-evoking words, 872 senses, and about 13000 annotated frames, with their semantic roles assigned to portions of text. The French FrameNet is freely available at alpage.inria.fr/asfalda

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

Open Archive Toulouse Archive Ouverte

Hal-Diderot

Semi-Automatic Deep Syntactic Annotations of the French Treebank

Author: Candito Marie
Ribeyre Corentin
Seddah Djamé
Publication venue: HAL CCSD
Publication date: 12/12/2014
Field of study

INRIA a CCSD electronic archive server

An improved neural network model for joint POS tagging and dependency parsing

Author: Nguyen Dat Quoc
Verspoor Karin
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating a BiLSTM-based tagging component to produce automatically predicted POS tags for the parser. On the benchmark English Penn treebank, our model obtains strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+% absolute improvements to the BIST graph-based parser, and also obtaining a state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental results on parsing 61 "big" Universal Dependencies treebanks from raw texts show that our model outperforms the baseline UDPipe (Straka and Strakov\'a, 2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS score. In addition, with our model, we also obtain state-of-the-art downstream task scores for biomedical event extraction and opinion analysis applications. Our code is available together with all pre-trained models at: https://github.com/datquocnguyen/jPTDPComment: 11 pages; In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, to appea

arXiv.org e-Print Archive

Crossref

Tour d'Horizon du French QuestionBank : Construire un Corpus Arboré de Questions pour le Français

Author: Candito Marie
Seddah Djamé
Publication venue: HAL CCSD
Publication date: 26/06/2017
Field of study

National audienceWe present the French QuestionBank, a treebank of 2600 questions, annotated with dependencyphrase-based structures. Two thirds being aligned with the English QuestionBank (Judge et al., 2006)and being freely available, this treebank will prove useful to build robust NLP systems. We alsodiscuss the development costs of such ressources.Nous présentons le French QuestionBank, un corpus arboré composé de 2600 questions annotées en dépendances et en constituants. Les deux tiers étant alignés avec le QuestionBank de l'anglais (Judge et al., 2006), libre de droits, ce corpus saura prouver son utilité pour construire des systèmes d'analyse robuste. Nous discutons aussi des coûts de développement de tels corpus

INRIA a CCSD electronic archive server

Hal-Diderot

A type-logical treebank for French

Author: Moot Richard
Publication venue: 'Institute of Computer Science, Polish Academy of Sciences'
Publication date: 01/01/2015
Field of study

International audienceThe goal of the current paper is to describe the TLGbank, a treebank of type-logical proof semi-automatically extracted from the French Treebank. Though the framework chosen for the treebank are multimodal type-logical grammars, we have ensured that the analysis is compatible with other mondern type-logical grammars, such the displacement calculus and first-order linear logic. We describe the extraction procedure, analyse first results and compare the treebank to the CCGbank

Biblioteka Nauki - repozytorium artykuÅÃ³w

Directory of Open Access Journals

Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax

Author: Fort Karën
Guillaume Bruno
Lefèbvre Nicolas
Publication venue: HAL CCSD
Publication date: 12/12/2016
Field of study

International audienceThis article presents the results we obtained on a complex annotation task (that of dependency syntax) using a specifically designed Game with a Purpose, ZombiLingo. We show that with suitable mechanisms (decomposition of the task, training of the players and regular control of the annotation quality during the game), it is possible to obtain annotations whose quality is significantly higher than that obtainable with a parser, provided that enough players participate. The source code of the game and the resulting annotated corpora (for French) are freely available

INRIA a CCSD electronic archive server

HAL Descartes

Un schéma d'annotation en dépendances syntaxiques profondes pour le français

Author: Candito Marie
Fort Karen
Guillaume Bruno
Perrier Guy
Ribeyre Corentin
Seddah Djamé
Publication venue: HAL CCSD
Publication date: 01/07/2014
Field of study

International audienceWe describe in this article an annotation scheme for deep dependency syntax, built from the surface annotation scheme of the Sequoia corpus, abstracting away from it and expressing the grammatical relations between content words. When these grammatical relations take part into verbal diatheses, we consider the diatheses as resulting from redistributions from the canonical diathesis, which we retain in our annotation scheme.À partir du schéma d'annotation en dépendances syntaxiques de surface du corpus Sequoia, nous proposons un schéma en dépendances syntaxiques profondes qui en est une abstraction exprimant les relations grammaticales entre mots sémantiquement pleins. Quand ces relations grammaticales sont partie prenante de diathèses verbales, ces diathèses sont vues comme le résultat de redistributions à partir d'une diathèse canonique et c'est cette dernière qui est retenue dans notre schéma d'annotation syntaxique profonde

INRIA a CCSD electronic archive server

HAL Descartes

CamemBERT: a Tasty French Language Model

Author: de la Clergerie Éric Villemonte
Dupont Yoann
Martin Louis
Muller Benjamin
Romary Laurent
Sagot Benoît
Seddah Djamé
Suárez Pedro Javier Ortiz
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.Comment: ACL 2020 long paper. Web site: https://camembert-model.f

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes