18 research outputs found

    Morphology based automatic acquisition of large-coverage lexica

    Get PDF
    International audienceIn this article, we introduce a new technique for constructing wide-coverage morphological lexica from large corpora and morphological knowledge, with an application to French. Basically, it relies on the idea that the existence of a hypothetical lemma can be guessed if several different words found in the corpus are best interpreted as morphological variants of this lemma. We first validated our technique by extracting verbs and adjectives on a general French corpus of 25 million words. Compared with other lexical resources available for French, our results are very satisfying, since we cover many words, often derived words, that are not always present in other lexica. Application of our algorithm to the acquisition of domain-specific adjectives on a botanic corpus gave also very good results, thus demonstrating its usability to extract domain-specific lexica. Moreover, it is generalizable to any language with a substantial morphology


    Get PDF
    Nous présentons un lexique syntaxique des verbes du français. La ressource contient 8800 entrées environ (soit 6700 verbes distincts), pour lesquels nous produisons les formes conjuguées, leurs formes phonétisées correspondantes ainsi qu'un indice sur leurs fréquences d'usage. Pour chacun des verbes est donné son auxiliaire, son caractÚre pronominal et les informations caractérisant sa transitivité. Durant la constitution de cette ressource, nous avons apporté un soin particulier à valider les entrées produites en croisant nos résultats avec d'autres ressources de référence.Nous mettons à la disposition de la communauté une version préliminaire du lexique, la ressource électronique VfrLPL1.0.xml, pour laquelle les fréquences d'usage n'ont pas été recalculées.Ce travail s'inscrit dans un programme mené au Laboratoire Parole et Langage depuis quelques années, visant au développement et à la maintenance d'une ressource lexicale fiable et couvrante pour le français

    From GLÀFF to PsychoGLÀFF: a large psycholinguistics-oriented French lexical resource

    Get PDF
    International audienceIn this paper, we present two French lexical resources, GLÀFF and PsychoGLÀFF. The former, automatically extracted from the collaborative online dictionary Wiktionary, is a large-scale versatile lexicon exploitable in Natural Language Processing applications and linguistic studies. The latter, based on GLÀFF, is a lexicon specifically designed for psycholinguistic research. GLÀFF, counting more than 1.4 million entries, features an unprecedented size. It reports lemmas, main syntactic categories, inflectional features and phonemic transcriptions. PsychoGLÀFF contains additional information related to formal aspects of the lexicon and its distribution. It contains about 340,000 entries (120,000 lemmas) that are corpora-attested. We explain how the resources have been created and compare them to other known resources in terms of coverage and quality. Regarding PsychoGLÀFF, the comparison shows that it has an exceptionally large repertoire while having a comparable quality

    A Bare-bones Constraint Grammar

    Get PDF

    The Lefff 2 syntactic lexicon for French: architecture, acquisition, use

    Get PDF
    International audienceIn this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fl ́echies du franc ̧ ais – Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French

    GLÀFF, un Gros Lexique À tout Faire du Français

    Get PDF
    International audienceThis paper introduces GLÀFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLÀFF contains, for each entry, a morphosyntactic description and a phonetic transcription. It distinguishes itself from the other available lexicons mainly by its size, its potential for constant updating and its copylefted license that makes it available for use, modification and redistribution. We explain how we have built GLÀFF and compare it to other known resources. We show that its size and quality are strong assets that could allow GLÀFF to become a reference lexicon for NLP, linguistics and psycholinguistics.Cet article prĂ©sente GLÀFF, un lexique du français Ă  large couverture extrait du Wiktionnaire, le dictionnaire collaboratif en ligne. GLÀFF contient pour chaque entrĂ©e une description morphosyntaxique et une transcription phonĂ©mique. Il se distingue des autres lexiques existants principalement par sa taille, sa licence libre et la possibilitĂ© de le faire Ă©voluer de façon constante. Nous dĂ©crivons ici comment nous l'avons construit, puis caractĂ©risĂ© en le comparant Ă  diffĂ©rentes ressources connues. Cette comparaison montre que sa taille et sa qualitĂ© font de GLÀFF un candidat sĂ©rieux comme nouvelle ressource standard pour le TAL, la linguistique et la psycholinguistique

    Évaluer SynLex

    Get PDF
    National audienceSYNLEX is a syntactic lexicon extracted semi-automatically from the LADL tables. Like the other syntactic lexicons for French which are both available and usable for NLP (LEFFF, DICOVALENCE), it is incomplete and its recall and precision wrt a gold standard are unknown.We present an approach which goes some way towards adressing these shortcomings. The approach draws on methods used for the automatic acquisition of syntactic lexicons. First, a new syntactic lexicon is acquired from an 82 million words corpus. This lexicon is then used to validate and extend SYNLEX. Finally, the recall and precision of the extended version of SYNLEX is computed based on a gold standard extracted from DICOVALENCE

    ChaĂźnes de traitement syntaxique

    Get PDF
    International audienceThis paper presents a method in order to have a good syntactic representation of written texts

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French

    Get PDF
    International audienceIn this paper, we introduce the Lefff , a freely available, accurate and large-coverage morphological and syntactic lexicon for French, used in many NLP tools such as large-coverage parsers. We ïŹrst describe Alexina, the lexical framework in which the Lefff is developed as well as the linguistic notions and formalisms it is based on. Next, we describe the various sources of lexical data we used for building the Lefff , in particular semi-automatic lexical development techniques and conversion and merging of existing resources. Finally, we illustrate the coverage and precision of the resource by comparing it with other resources and by assessing its impact in various NLP tools