81 research outputs found

    Attribute Multiset Grammars for Global Explanations of Activities

    Get PDF

    What sort of cognitive hypothesis is a derivational theory of grammar?

    Get PDF
    This paper has two closely related aims. The main aim is to lay out one specific way in which the derivational aspects of a grammatical theory can contribute to the cognitive claims made by that theory, to demonstrate that it is not only a theory’s posited representations that testable cognitive hypotheses derive from. This requires, however, an understanding of grammatical derivations that initially appears somewhat unnatural in the context of modern generative syntax. The second aim is to argue that this impression is misleading: certain accidents of the way our theories developed over the decades have led to a situation that makes it artificially difficult to apply the understanding of derivations that I adopt to modern generative grammar. Comparisons with other derivational formalisms and with earlier generative grammars serve to clarify the question of how derivational systems can, in general, constitute hypotheses about mental phenomena

    The use of non-formal information in reverse engineering and software reuse

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Within the field of software maintenance, both reverse engineering and software reuse have been suggested as ways of salvaging some of the investment made in software that is now out of date. One goal that is shared by both reverse engineering and reuse is a desire to be able to redescribe source code, that is to produce higher level descriptions of existing code. The fundamental theme of this thesis is that from a maintenance perspective, source code should be considered primarily as a text. This emphasizes its role as a medium for communication between humans rather than as a medium for human-computer communication. Characteristic of this view is the need to incorporate the analysis of non-formal information, such as comments and identifier names, when developing tools to redescribe code. Many existing tools fail to do this. To justify this text-based view of source code, an investigation into the possible use of non-formal information to index pieces of source code was undertaken. This involved attempting to assign descriptors that represent the code's function to pieces of source code from IBM's CICS project. The results of this investigation support the view that the use of nonformal information can be of practical value in redescribing source code. However, the results fail to suggest that using non-formal information will overcome any of the major difficulties associated with developing tools to redescribe code. This is used to suggest future directions for research

    Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource Scenarios

    Get PDF
    With the high cost of manually labeling data and the increasing interest in low-resource languages, for which human annotators might not be even available, unsupervised approaches have become essential for processing a typologically diverse set of languages, whether high-resource or low-resource. In this work, we propose new fully unsupervised approaches for two tasks in morphology: unsupervised morphological segmentation and unsupervised cross-lingual part-of-speech (POS) tagging, which have been two essential subtasks for several downstream NLP applications, such as machine translation, speech recognition, information extraction and question answering. We propose a new unsupervised morphological-segmentation approach that utilizes Adaptor Grammars (AGs), nonparametric Bayesian models that generalize probabilistic context-free grammars (PCFGs), where a PCFG models word structure in the task of morphological segmentation. We implement the approach as a publicly available morphological-segmentation framework, MorphAGram, that enables unsupervised morphological segmentation through the use of several proposed language-independent grammars. In addition, the framework allows for the use of scholar knowledge, when available, in the form of affixes that can be seeded into the grammars. The framework handles the cases when the scholar-seeded knowledge is either generated from language resources, possibly by someone who does not know the language, as weak linguistic priors, or generated by an expert in the underlying language as strong linguistic priors. Another form of linguistic priors is the design of a grammar that models language-dependent specifications. We also propose a fully unsupervised learning setting that approximates the effect of scholar-seeded knowledge through self-training. Moreover, since there is no single grammar that works best across all languages, we propose an approach that picks a nearly optimal configuration (a learning setting and a grammar) for an unseen language, a language that is not part of the development. Finally, we examine multilingual learning for unsupervised morphological segmentation in low-resource setups. For unsupervised POS tagging, two cross-lingual approaches have been widely adapted: 1) annotation projection, where POS annotations are projected across an aligned parallel text from a source language for which a POS tagger is accessible to the target one prior to training a POS model; and 2) zero-shot model transfer, where a model of a source language is directly applied on texts in the target language. We propose an end-to-end architecture for unsupervised cross-lingual POS tagging via annotation projection in truly low-resource scenarios that do not assume access to parallel corpora that are large in size or represent a specific domain. We integrate and expand the best practices in alignment and projection and design a rich neural architecture that exploits non-contextualized and transformer-based contextualized word embeddings, affix embeddings and word-cluster embeddings. Additionally, since parallel data might be available between the target language and multiple source ones, as in the case of the Bible, we propose different approaches for learning from multiple sources. Finally, we combine our work on unsupervised morphological segmentation and unsupervised cross-lingual POS tagging by conducting unsupervised stem-based cross-lingual POS tagging via annotation projection, which relies on the stem as the core unit of abstraction for alignment and projection, which is beneficial to low-resource morphologically complex languages. We also examine morpheme-based alignment and projection, the use of linguistic priors towards better POS models and the use of segmentation information as learning features in the neural architecture. We conduct comprehensive evaluation and analysis to assess the performance of our approaches of unsupervised morphological segmentation and unsupervised POS tagging and show that they achieve the state-of-the-art performance for the two morphology tasks when evaluated on a large set of languages of different typologies: analytic, fusional, agglutinative and synthetic/polysynthetic

    Rich Linguistic Structure from Large-Scale Web Data

    Get PDF
    The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.Engineering and Applied Science
    corecore