3,672 research outputs found

    Corpora and evaluation tools for multilingual named entity grammar development

    Get PDF
    We present an effort for the development of multilingual named entity grammars in a unification-based finite-state formalism (SProUT). Following an extended version of the MUC7 standard, we have developed Named Entity Recognition grammars for German, Chinese, Japanese, French, Spanish, English, and Czech. The grammars recognize person names, organizations, geographical locations, currency, time and date expressions. Subgrammars and gazetteers are shared as much as possible for the grammars of the different languages. Multilingual corpora from the business domain are used for grammar development and evaluation. The annotation format (named entity and other linguistic information) is described. We present an evaluation tool which provides detailed statistics and diagnostics, allows for partial matching of annotations, and supports user-defined mappings between different annotation and grammar output formats

    Political Discourse Analysis through Solving Problems of Graph Theory

    Get PDF
    In this article, we show how, using graph theory, we can make a content analysis of political discourse. Assumptions of this analysis are: • we have a corpus of speech of each party or candidate; • we consider that speech conveys economic, political, socio-cultural values, these taking the form of words or word families; • we consider that there are interdependences between the values of a political discourse; they are given by the co-occurrence of two values, as words in the text, within a well defined fragment, or they are determined by the internal logic of political discourse; • established links between values in a political speech have associated positive numbers indicating the "power" of those links; these "powers" are defined according to both the number of co-occurrences of values, and the internal logic of the discourse where they occur. In this context we intend to highlight the following: a) which is the dominant value in a political speech; b) which groups of values have ties between them and have no connection with the rest; c) which is the order in which political values should be set in order to obtain an equivalent but more synthetic speech compared to the already given one; d) which are the links between values that form the "core" political speech. To solve these problems, we shall use the Political Analyst program. After that, we shall present the concepts necessary to the understanding of the introductory graph theory, useful in understanding the analysis of the software and then the operation of the program. This paper extends the previous paper [6]graph theory, discourse analysis, political programs

    AmAMorph: Finite State Morphological Analyzer for Amazighe

    Get PDF
    This paper presents AmAMorph, a morphological analyzer for Amazighe language using a system based on the NooJ linguistic development environment. The paper begins with the development of Amazighe lexicons with large coverage formalization. The built electronic lexicons, named ‘NAmLex’, ‘VAmLex’ and ‘PAmLex’ which stand for ‘Noun Amazighe Lexicon’, ‘Verb Amazighe Lexicon’ and ‘Particles Amazighe Lexicon’, link inflectional, morphological, and syntacticsemantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over inflected forms. To our knowledge,AmAMorph is the first morphological analyzer for Amazighe. It identifies the component morphemes of the forms using large coverage morphological grammars. Along with the description of how the analyzer is implemented, this paper gives an evaluation of the analyzer

    IDEF5 Ontology Description Capture Method: Concept Paper

    Get PDF
    The results of research towards an ontology capture method referred to as IDEF5 are presented. Viewed simply as the study of what exists in a domain, ontology is an activity that can be understood to be at work across the full range of human inquiry prompted by the persistent effort to understand the world in which it has found itself - and which it has helped to shape. In the contest of information management, ontology is the task of extracting the structure of a given engineering, manufacturing, business, or logistical domain and storing it in an usable representational medium. A key to effective integration is a system ontology that can be accessed and modified across domains and which captures common features of the overall system relevant to the goals of the disparate domains. If the focus is on information integration, then the strongest motivation for ontology comes from the need to support data sharing and function interoperability. In the correct architecture, an enterprise ontology base would allow th e construction of an integrated environment in which legacy systems appear to be open architecture integrated resources. If the focus is on system/software development, then support for the rapid acquisition of reliable systems is perhaps the strongest motivation for ontology. Finally, ontological analysis was demonstrated to be an effective first step in the construction of robust knowledge based systems

    Natural Language Processing for Under-resourced Languages: Developing a Welsh Natural Language Toolkit

    Get PDF
    Language technology is becoming increasingly important across a variety of application domains which have become common place in large, well-resourced languages. However, there is a danger that small, under-resourced languages are being increasingly pushed to the technological margins. Under-resourced languages face significant challenges in delivering the underlying language resources necessary to support such applications. This paper describes the development of a natural language processing toolkit for an under-resourced language, Cymraeg (Welsh). Rather than creating the Welsh Natural Language Toolkit (WNLT) from scratch, the approach involved adapting and enhancing the language processing functionality provided for other languages within an existing framework and making use of external language resources where available. This paper begins by introducing the GATE NLP framework, which was used as the development platform for the WNLT. It then describes each of the core modules of the WNLT in turn, detailing the extensions and adaptations required for Welsh language processing. An evaluation of the WNLT is then reported. Following this, two demonstration applications are presented. The first is a simple text mining application that analyses wedding announcements. The second describes the development of a Twitter NLP application, which extends the core WNLT pipeline. As a relatively small-scale project, the WNLT makes use of existing external language resources where possible, rather than creating new resources. This approach of adaptation and reuse can provide a practical and achievable route to developing language resources for under-resourced languages
    corecore