288 research outputs found

    Modular resource development and diagnostic evaluation framework for fast NLP system improvement

    Get PDF
    Natural Language Processing systems are large-scale softwares, whose development involves many man-years of work, in terms of both coding and resource development. Given a dictionary of 110k lemmas, a few hundred syntactic analysis rules, 20k ngrams matrices and other resources, what will be the impact on a syntactic analyzer of adding a new possible category to a given verb? What will be the consequences of a new syntactic rules addition? Any modification may imply, besides what was expected, unforeseeable side-effects and the complexity of the system makes it difficult to guess the overall impact of even small changes. We present here a framework designed to effectively and iteratively improve the accuracy of our linguistic analyzer LIMA by iterative refinements of its linguistic resources. These improvements are continuously assessed by evaluating the analyzer performance against a reference corpus. Our first results show that this framework is really helpful towards this goal

    Constraint-Based Parsing as an Efficient Solution: Results from the Parsing Evaluation Campaign EASy

    No full text
    International audienceThis paper describes the unfolding of the EASy evaluation campaign for French parsers as well as the techniques employed for the participation of laboratory LPL to this campaign. Three symbolic parsers based on a same resource and a same formalism (Property Grammars) are described and evaluated. The first results of this evaluation are analyzed and lead to the conclusion that symbolic parsing in a constraint-based formalism is efficient and robust

    D4.1. Technologies and tools for corpus creation, normalization and annotation

    Get PDF
    The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

    Automatic rich annotation of large corpus of conversational transcribed speech

    Get PDF
    International audienceThis paper describes the use of the CasSys platform in order to achieve the chunking of conversational speech transcripts by means of cascades of Unitex transducers. Our system is involved in the EPAC project of the French National agency of Research (ANR). The aim of this project is to develop robust methods for the annotation of audio/multimedia document collections which contains conversational speech sequences such as TV or radio programs. At first, this paper presents the EPAC project and the adaptation of a former chunking system (Romus) which was developed in the restricted framework of dedicated spoken man-machine dialogue. Then, it describes the problems that are arising due to 1) spontaneous speech disfluencies and 2) errors for the previous stages of processing (automatic speech recognition and POS tagging)

    Des relations d'alignement pour décrire l'interaction des domaines linguistiques : vers des Grammaires Multimodales

    Get PDF
    International audienceUn des problèmes majeurs de la linguistique aujourd'hui réside dans la prise en compte de phénomènes relevant de domaines et de modalités différentes. Dans la littérature, la réponse consiste à représenter les relations pouvant exister entre ces domaines de façon externe, en termes de relation de structure à structure, s'appuyant donc sur une description distincte de chaque domaine ou chaque modalité. Nous proposons dans cet article une approche différente permettant représenter ces phénomènes dans un cadre formel unique, permettant de rendre compte au sein d'une même grammaire tous les phénomènes concernés. Cette représentation précise de l'interaction entre domaines et modalités s'appuie sur la définition de relations d'alignement

    Proceedings

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Evaluation of a Grammar of French Determiners

    Get PDF
    Existing syntactic grammars of natural languages, even with a far from complete coverage, are complex objects. Assessments of the quality of parts of such grammars are useful for the validation of their construction. We evaluated the quality of a grammar of French determiners that takes the form of a recursive transition network. The result of the application of this local grammar gives deeper syntactic information than chunking or information available in treebanks. We performed the evaluation by comparison with a corpus independently annotated with information on determiners. We obtained 86% precision and 92% recall on text not tagged for parts of speech.Comment: 10 page
    corecore