22,221 research outputs found

    Corpora and evaluation tools for multilingual named entity grammar development

    Get PDF
    We present an effort for the development of multilingual named entity grammars in a unification-based finite-state formalism (SProUT). Following an extended version of the MUC7 standard, we have developed Named Entity Recognition grammars for German, Chinese, Japanese, French, Spanish, English, and Czech. The grammars recognize person names, organizations, geographical locations, currency, time and date expressions. Subgrammars and gazetteers are shared as much as possible for the grammars of the different languages. Multilingual corpora from the business domain are used for grammar development and evaluation. The annotation format (named entity and other linguistic information) is described. We present an evaluation tool which provides detailed statistics and diagnostics, allows for partial matching of annotations, and supports user-defined mappings between different annotation and grammar output formats

    Evaluating two methods for Treebank grammar compaction

    Get PDF
    Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar. In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision

    From treebank resources to LFG F-structures

    Get PDF
    We present two methods for automatically annotating treebank resources with functional structures. Both methods define systematic patterns of correspondence between partial PS configurations and functional structures. These are applied to PS rules extracted from treebanks, or directly to constraint set encodings of treebank PS trees

    Syntactic parsing of unrestricted Spanish text

    Get PDF
    This research focusses on the syntactical parsing of morphologycal tagged corpora. A proposal for a corpus oriented Spanish grammar is presented in this document. This work has been developed in the framework of the ITEM project and its main goal is to provide multilingual background for information extraction and retrieval tasks. The main goal of Tacat analyser is to provide a way of obtaining large amounts of bracketed and parsed corpora, both general land specific domain. Tacat uses context free grammars and has as input following categories of Parole specification.The incremental methodology that we use allows us to recognise different levels of complexity in the analysis and to produce compatible outputs of all the grammars.Postprint (published version

    Automatic acquisition of LFG resources for German - as good as it gets

    Get PDF
    We present data-driven methods for the acquisition of LFG resources from two German treebanks. We discuss problems specific to semi-free word order languages as well as problems arising fromthe data structures determined by the design of the different treebanks. We compare two ways of encoding semi-free word order, as done in the two German treebanks, and argue that the design of the TiGer treebank is more adequate for the acquisition of LFG resources. Furthermore, we describe an architecture for LFG grammar acquisition for German, based on the two German treebanks, and compare our results with a hand-crafted German LFG grammar
    corecore