55 research outputs found

    Deep Syntax Annotation of the Sequoia French Treebank

    Get PDF
    International audienceWe define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named DEEP - SEQUOIA, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis

    Semi-Automatic Deep Syntactic Annotations of the French Treebank

    Get PDF
    International audienceWe describe and evaluate the semi-automatic addition of a deep syntactic layer to the French Treebank (Abeillé and Barrier [1]), using an existing scheme (Candito et al. [6]). While some rare or highly ambiguous deep phenomena are handled manually, the remainings are derived using a graph-rewriting system (Ribeyre et al. [22]). Although not manually corrected, we think the resulting Deep Representations can pave the way for the emergence of deep syntactic parsers for French

    Corpus annotation within the French FrameNet: a domain-by-domain methodology

    Get PDF
    International audienceThis paper reports on the development of a French FrameNet, within the ASFALDA project. While the first phase of the project focused on the development of a French set of frames and corresponding lexicon (Candito et al., 2014), this paper concentrates on the subsequent corpus annotation phase, which focused on four notional domains (commercial transactions, cognitive stances, causality and verbal communication). Given full coverage is not reachable for a relatively " new " FrameNet project, we advocate that focusing on specific notional domains allowed us to obtain full lexical coverage for the frames of these domains, while partially reflecting word sense ambiguities. Furthermore, as frames and roles were annotated on two French Treebanks (the French Treebank (Abeillé and Barrier, 2004) and the Sequoia Treebank (Candito and Seddah, 2012), we were able to extract a syntactico-semantic lexicon from the annotated frames. In the resource's current status, there are 98 frames, 662 frame-evoking words, 872 senses, and about 13000 annotated frames, with their semantic roles assigned to portions of text. The French FrameNet is freely available at alpage.inria.fr/asfalda

    Semi-Automatic Deep Syntactic Annotations of the French Treebank

    Get PDF
    International audienceWe describe and evaluate the semi-automatic addition of a deep syntactic layer to the French Treebank (Abeillé and Barrier [1]), using an existing scheme (Candito et al. [6]). While some rare or highly ambiguous deep phenomena are handled manually, the remainings are derived using a graph-rewriting system (Ribeyre et al. [22]). Although not manually corrected, we think the resulting Deep Representations can pave the way for the emergence of deep syntactic parsers for French

    An improved neural network model for joint POS tagging and dependency parsing

    Full text link
    We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating a BiLSTM-based tagging component to produce automatically predicted POS tags for the parser. On the benchmark English Penn treebank, our model obtains strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+% absolute improvements to the BIST graph-based parser, and also obtaining a state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental results on parsing 61 "big" Universal Dependencies treebanks from raw texts show that our model outperforms the baseline UDPipe (Straka and Strakov\'a, 2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS score. In addition, with our model, we also obtain state-of-the-art downstream task scores for biomedical event extraction and opinion analysis applications. Our code is available together with all pre-trained models at: https://github.com/datquocnguyen/jPTDPComment: 11 pages; In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, to appea

    Tour d'Horizon du French QuestionBank : Construire un Corpus Arboré de Questions pour le Français

    Get PDF
    National audienceWe present the French QuestionBank, a treebank of 2600 questions, annotated with dependencyphrase-based structures. Two thirds being aligned with the English QuestionBank (Judge et al., 2006)and being freely available, this treebank will prove useful to build robust NLP systems. We alsodiscuss the development costs of such ressources.Nous présentons le French QuestionBank, un corpus arboré composé de 2600 questions annotées en dépendances et en constituants. Les deux tiers étant alignés avec le QuestionBank de l'anglais (Judge et al., 2006), libre de droits, ce corpus saura prouver son utilité pour construire des systÚmes d'analyse robuste. Nous discutons aussi des coûts de développement de tels corpus

    A type-logical treebank for French

    Get PDF
    International audienceThe goal of the current paper is to describe the TLGbank, a treebank of type-logical proof semi-automatically extracted from the French Treebank. Though the framework chosen for the treebank are multimodal type-logical grammars, we have ensured that the analysis is compatible with other mondern type-logical grammars, such the displacement calculus and first-order linear logic. We describe the extraction procedure, analyse first results and compare the treebank to the CCGbank

    Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax

    Get PDF
    International audienceThis article presents the results we obtained on a complex annotation task (that of dependency syntax) using a specifically designed Game with a Purpose, ZombiLingo. We show that with suitable mechanisms (decomposition of the task, training of the players and regular control of the annotation quality during the game), it is possible to obtain annotations whose quality is significantly higher than that obtainable with a parser, provided that enough players participate. The source code of the game and the resulting annotated corpora (for French) are freely available

    Un schéma d'annotation en dépendances syntaxiques profondes pour le français

    Get PDF
    International audienceWe describe in this article an annotation scheme for deep dependency syntax, built from the surface annotation scheme of the Sequoia corpus, abstracting away from it and expressing the grammatical relations between content words. When these grammatical relations take part into verbal diatheses, we consider the diatheses as resulting from redistributions from the canonical diathesis, which we retain in our annotation scheme.À partir du schĂ©ma d'annotation en dĂ©pendances syntaxiques de surface du corpus Sequoia, nous proposons un schĂ©ma en dĂ©pendances syntaxiques profondes qui en est une abstraction exprimant les relations grammaticales entre mots sĂ©mantiquement pleins. Quand ces relations grammaticales sont partie prenante de diathĂšses verbales, ces diathĂšses sont vues comme le rĂ©sultat de redistributions Ă  partir d'une diathĂšse canonique et c'est cette derniĂšre qui est retenue dans notre schĂ©ma d'annotation syntaxique profonde

    CamemBERT: a Tasty French Language Model

    Get PDF
    Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.Comment: ACL 2020 long paper. Web site: https://camembert-model.f
    • 

    corecore