885 research outputs found

    Part-of-Speech Tag Disambiguation by Cross-Linguistic Majority Vote

    Full text link

    Combining ontologies and neural networks for analyzing historical language varieties: a case study in Middle Low German

    Get PDF
    In this paper, we describe experiments on the morphosyntactic annotation of historical language varieties for the example of Middle Low German (MLG), the official language of the German Hanse during the Middle Ages and a dominant language around the Baltic Sea by the time. To our best knowledge, this is the first experiment in automatically producing morphosyntactic annotations for Middle Low German, and accordingly, no part-of-speech (POS) tagset is currently agreed upon. In our experiment, we illustrate how ontology-based specifications of projected annotations can be employed to circumvent this issue: Instead of training and evaluating against a given tagset, we decomponse it into independent features which are predicted independently by a neural network. Using consistency constraints (axioms) from an ontology, then, the predicted feature probabilities are decoded into a sound ontological representation. Using these representations, we can finally bootstrap a POS tagset capturing only morphosyntactic features which could be reliably predicted. In this way, our approach is capable to optimize precision and recall of morphosyntactic annotations simultaneously with bootstrapping a tagset rather than performing iterative cycles

    Domain-Specific Knowledge Acquisition for Conceptual Sentence Analysis

    Get PDF
    The availability of on-line corpora is rapidly changing the field of natural language processing (NLP) from one dominated by theoretical models of often very specific linguistic phenomena to one guided by computational models that simultaneously account for a wide variety of phenomena that occur in real-world text. Thus far, among the best-performing and most robust systems for reading and summarizing large amounts of real-world text are knowledge-based natural language systems. These systems rely heavily on domain-specific, handcrafted knowledge to handle the myriad syntactic, semantic, and pragmatic ambiguities that pervade virtually all aspects of sentence analysis. Not surprisingly, however, generating this knowledge for new domains is time-consuming, difficult, and error-prone, and requires the expertise of computational linguists familiar with the underlying NLP system. This thesis presents Kenmore, a general framework for domain-specific knowledge acquisition for conceptual sentence analysis. To ease the acquisition of knowledge in new domains, Kenmore exploits an on-line corpus using symbolic machine learning techniques and robust sentence analysis while requiring only minimal human intervention. Unlike most approaches to knowledge acquisition for natural language systems, the framework uniformly addresses a range of subproblems in sentence analysis, each of which traditionally had required a separate computational mechanism. The thesis presents the results of using Kenmore with corpora from two real-world domains (1) to perform part-of-speech tagging, semantic feature tagging, and concept tagging of all open-class words in the corpus; (2) to acquire heuristics for part-ofspeech disambiguation, semantic feature disambiguation, and concept activation; and (3) to find the antecedents of relative pronouns

    Combining Language Independent Part-of-Speech Tagging Tools

    Get PDF
    Part-of-speech tagging is a fundamental task of natural language processing. For languages with a very rich agglutinating morphology, generic PoS tagging algorithms do not yield very high accuracy due to data sparseness issues. Though integrating a morphological analyzer can efficiently solve this problem, this is a resource-intensive solution. In this paper we show a method of combining language independent statistical solutions -- including a statistical machine translation tool -- of PoS-tagging to effectively boost tagging accuracy. Our experiments show that, using the same training set, our combination of language independent tools yield an accuracy that approaches that of a language dependent system with an integrated morphological analyzer

    Memory-Based Shallow Parsing

    Full text link
    We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement

    Statistical model of human lexical category disambiguation

    Get PDF
    Research in Sentence Processing is concerned with discovering the mechanism by which linguistic utterances are mapped onto meaningful representations within the human mind. Models of the Human Sentence Processing Mechanism (HSPM) can be divided into those in which such mapping is performed by a number of limited modular processes and those in which there is a single interactive process. A further, and increasingly important, distinction is between models which rely on innate preferences to guide decision processes and those which make use of experiencebased statistics. In this context, the aims of the current thesis are two-fold: • To argue that the correct architecture of the HSPM is both modular and statistical - the Modular Statistical Hypothesis (MSH). • To propose and provide empirical support for a position in which human lexical category disambiguation occurs within a modular process, distinct from syntactic parsing and guided by a statistical decision process. Arguments are given for why a modular statistical architecture should be preferred on both methodological and rational grounds. We then turn to the (often ignored) problem of lexical category disambiguation and propose the existence of a presyntactic Statistical Lexical Category Module (SLCM). A number of variants of the SLCM are introduced. By empirically investigating this particular architecture we also hope to provide support for the more general hypothesis - the MSH. The SLCM has some interesting behavioural properties; the remainder of the thesis empirically investigates whether these behaviours are observable in human sentence processing. We first consider whether the results of existing studies might be attributable to SLCM behaviour. Such evaluation provides support for an HSPM architecture that includes this SLCM and allows us to determine which SLCM variant is empirically most plausible. Predictions are made, using this variant, to determine SLCM behaviour in the face of novel utterances; these predictions are then tested using a self-paced reading paradigm. The results of this experimentation fully support the inclusion of the SLCM in a model of the HSPM and are not compatible with other existing models. As the SLCM is a modular and statistical process, empirical evidence for the SLCM also directly supports an HSPM architecture which is modular and statistical. We therefore conclude that our results strongly support both the SLCM and the MSH. However, more work is needed, both to produce further evidence and to define the model further
    • …
    corecore