Search CORE

18 research outputs found

Morphology based automatic acquisition of large-coverage lexica

Author: Clément Lionel
Lang Bernard
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

International audienceIn this article, we introduce a new technique for constructing wide-coverage morphological lexica from large corpora and morphological knowledge, with an application to French. Basically, it relies on the idea that the existence of a hypothetical lemma can be guessed if several different words found in the corpus are best interpreted as morphological variants of this lemma. We first validated our technique by extracting verbs and adjectives on a general French corpus of 25 million words. Compared with other lexical resources available for French, our results are very satisfying, since we cover many words, often derived words, that are not always present in other lexica. Application of our algorithm to the acquisition of domain-specific adjectives on a botanic corpus gave also very good results, thus demonstrating its usability to extract domain-specific lexica. Moreover, it is generalizable to any language with a substantial morphology

CiteSeerX

INRIA a CCSD electronic archive server

VfrLPL

Author: RAUZY Stéphane
Publication venue: http://lpl-aix.fr
Publication date: 11/05/2007
Field of study

Nous présentons un lexique syntaxique des verbes du français. La ressource contient 8800 entrées environ (soit 6700 verbes distincts), pour lesquels nous produisons les formes conjuguées, leurs formes phonétisées correspondantes ainsi qu'un indice sur leurs fréquences d'usage. Pour chacun des verbes est donné son auxiliaire, son caractère pronominal et les informations caractérisant sa transitivité. Durant la constitution de cette ressource, nous avons apporté un soin particulier à valider les entrées produites en croisant nos résultats avec d'autres ressources de référence.Nous mettons à la disposition de la communauté une version préliminaire du lexique, la ressource électronique VfrLPL1.0.xml, pour laquelle les fréquences d'usage n'ont pas été recalculées.Ce travail s'inscrit dans un programme mené au Laboratoire Parole et Langage depuis quelques années, visant au développement et à la maintenance d'une ressource lexicale fiable et couvrante pour le français

Speech & Language Data Repository (SLDR)

From GLÀFF to PsychoGLÀFF: a large psycholinguistics-oriented French lexical resource

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceIn this paper, we present two French lexical resources, GLÀFF and PsychoGLÀFF. The former, automatically extracted from the collaborative online dictionary Wiktionary, is a large-scale versatile lexicon exploitable in Natural Language Processing applications and linguistic studies. The latter, based on GLÀFF, is a lexicon specifically designed for psycholinguistic research. GLÀFF, counting more than 1.4 million entries, features an unprecedented size. It reports lemmas, main syntactic categories, inflectional features and phonemic transcriptions. PsychoGLÀFF contains additional information related to formal aspects of the lexicon and its distribution. It contains about 340,000 entries (120,000 lemmas) that are corpora-attested. We explain how the resources have been created and compare them to other known resources in terms of coverage and quality. Regarding PsychoGLÀFF, the comparison shows that it has an exceptionally large repertoire while having a comparable quality

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

A Bare-bones Constraint Grammar

Author: Bick Eckhard
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

University of Southern Denmark Research Output

The Lefff 2 syntactic lexicon for French: architecture, acquisition, use

Author: Boullier Pierre
Clément Lionel
Sagot Benoît
Villemonte de La Clergerie Éric
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

International audienceIn this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fl ́echies du franc ̧ ais – Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French

INRIA a CCSD electronic archive server

GLÀFF, un Gros Lexique À tout Faire du Français

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 17/06/2013
Field of study

International audienceThis paper introduces GLÀFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLÀFF contains, for each entry, a morphosyntactic description and a phonetic transcription. It distinguishes itself from the other available lexicons mainly by its size, its potential for constant updating and its copylefted license that makes it available for use, modification and redistribution. We explain how we have built GLÀFF and compare it to other known resources. We show that its size and quality are strong assets that could allow GLÀFF to become a reference lexicon for NLP, linguistics and psycholinguistics.Cet article présente GLÀFF, un lexique du français à large couverture extrait du Wiktionnaire, le dictionnaire collaboratif en ligne. GLÀFF contient pour chaque entrée une description morphosyntaxique et une transcription phonémique. Il se distingue des autres lexiques existants principalement par sa taille, sa licence libre et la possibilité de le faire évoluer de façon constante. Nous décrivons ici comment nous l'avons construit, puis caractérisé en le comparant à différentes ressources connues. Cette comparaison montre que sa taille et sa qualité font de GLÀFF un candidat sérieux comme nouvelle ressource standard pour le TAL, la linguistique et la psycholinguistique

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Évaluer SynLex

Author: Falk Ingrid
Francopoulo Gil
Gardent Claire
Publication venue: HAL CCSD
Publication date: 05/06/2007
Field of study

National audienceSYNLEX is a syntactic lexicon extracted semi-automatically from the LADL tables. Like the other syntactic lexicons for French which are both available and usable for NLP (LEFFF, DICOVALENCE), it is incomplete and its recall and precision wrt a gold standard are unknown.We present an approach which goes some way towards adressing these shortcomings. The approach draws on methods used for the automatic acquisition of syntactic lexicons. First, a new syntactic lexicon is acquired from an 82 million words corpus. This lexicon is then used to validate and extend SYNLEX. Finally, the recall and precision of the extended version of SYNLEX is computed based on a gold standard extracted from DICOVALENCE

INRIA a CCSD electronic archive server

Chaînes de traitement syntaxique

Author: Boullier Pierre
Clément Lionel
Sagot Benoît
Villemonte de La Clergerie Éric
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

International audienceThis paper presents a method in order to have a good syntactic representation of written texts

INRIA a CCSD electronic archive server

A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

Author: Ahmed Abdelali
Ahmed Lehireche
Denis Maurel
Noureddine Doumi
null null
Publication venue: IJIT
Publication date: 01/02/2016
Field of study

International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

Crossref

Directory of Open Access Journals

HAL Descartes

HAL Université de Tours

Hal-Diderot

The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceIn this paper, we introduce the Lefff , a freely available, accurate and large-coverage morphological and syntactic lexicon for French, used in many NLP tools such as large-coverage parsers. We ﬁrst describe Alexina, the lexical framework in which the Lefff is developed as well as the linguistic notions and formalisms it is based on. Next, we describe the various sources of lexical data we used for building the Lefff , in particular semi-automatic lexical development techniques and conversion and merging of existing resources. Finally, we illustrate the coverage and precision of the resource by comparing it with other resources and by assessing its impact in various NLP tools

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot