321 research outputs found

    Proceedings

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Towards a machine-learning architecture for lexical functional grammar parsing

    Get PDF
    Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

    Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages

    Get PDF
    Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010

    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2). 29 November 2012, Lisbon, Portugal

    Get PDF
    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), held in Lisbon, Portugal on 29 November 2012

    Automatic grammar induction from free text using insights from cognitive grammar

    Get PDF
    Automatic identification of the grammatical structure of a sentence is useful in many Natural Language Processing (NLP) applications such as Document Summarisation, Question Answering systems and Machine Translation. With the availability of syntactic treebanks, supervised parsers have been developed successfully for many major languages. However, for low-resourced minority languages with fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic annotation schemes motivated by different linguistic theories and formalisms which are sometimes language specific and they cannot always be adapted for developing syntactic parsers across different language families. This project aims to develop a linguistically motivated approach to the automatic induction of grammatical structures from raw sentences. Such an approach can be readily adapted to different languages including low-resourced minority languages. We draw the basic approach to linguistic analysis from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing by humans - such as incrementality, connectedness and expectation. Our implementation has three components: Schema Definition, Schema Assembly and Schema Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed through all the three components incrementally until all the words are exhausted and the entire sentence is analysed as an instance of one final construction schema. The order in which all intermediate schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers for English and Welsh (a low-resource minority language) were developed using the same approach with some changes to the Schema Definition component. We evaluated the parser performance by (a) Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys

    Respecting Relations: Memory Access and Antecedent Retrieval in Incremental Sentence Processing

    Get PDF
    This dissertation uses the processing of anaphoric relations to probe how linguistic information is encoded in and retrieved from memory during real-time sentence comprehension. More specifically, the dissertation attempts to resolve a tension between the demands of a linguistic processor implemented in a general-purpose cognitive architecture and the demands of abstract grammatical constraints that govern language use. The source of the tension is the role that abstract configurational relations (such as c-command, Reinhart 1983) play in constraining computations. Anaphoric dependencies are governed by formal grammatical constraints stated in terms of relations. For example, Binding Principle A (Chomsky 1981) requires that antecedents for local anaphors (like the English reciprocal each other) bear the c-command relation to those anaphors. In incremental sentence processing, antecedents of anaphors must be retrieved from memory. Recent research has motivated a model of processing that exploits a cue-based, associative retrieval process in content-addressable memory (e.g. Lewis, Vasishth & Van Dyke 2006) in which relations such as c-command are difficult to use as cues for retrieval. As such, the c-command constraints of formal grammars are predicted to be poorly implemented by the retrieval mechanism. I examine retrieval's sensitivity to three constraints on anaphoric dependencies: Principle A (via Hindi local reciprocal licensing), the Scope Constraint on bound-variable pronoun licensing (often stated as a c-command constraint, though see Barker 2012), and Crossover constraints on pronominal binding (Postal 1971, Wasow 1972). The data suggest that retrieval exhibits fidelity to the constraints: structurally inaccessible NPs that match an anaphoric element in morphological features do not interfere with the retrieval of an antecedent in most cases considered. In spite of this alignment, I argue that retrieval's apparent sensitivity to c-command constraints need not motivate a memory access procedure that makes direct reference to c-command relations. Instead, proxy features and general parsing operations conspire to mimic the extension of a system that respects c-command constraints. These strategies provide a robust approximation of grammatical performance while remaining within the confines of a independently- motivated general-purpose cognitive architecture

    Automatic Generation of Morpheme Level Reordering Rules for Korean to English Machine Translation

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 언어학과, 2017. 2. 신효필.Word order is one of the main challenges that machine translation systems must overcome when dealing with any linguistically divergent language pair, such as Korean and English. Statistical machine translation (SMT) models are often insufficient at long distance reordering due the distortion penalties they impose.Rule-based systems, on the other hand, are often costly, in both time and money, to build and maintain. The present research proposes a new hybrid approach for Korean to English machine translation. While previous approaches have focused on the word, our approach considers the morpheme as the basic unit of translation for this language pair. We begin by developing a classification model to disambiguate Korean functional morphemes based on alignment pairs and context feature data. Then, according to our automatically generated rules, we apply this model in a preprocessing step to reorder the morphemes to better match English sentence structure. After retraining our statistical translation system, Moses, results indicate an improvement in overall translation quality. When the SMT system's internal lexicalized reordering is restricted, we note an increase in the BLEU score of 3.5% over the SMT-only baseline. In the case where we do not limit decoding-time reordering, an even greater BLEU score increase of 4.42% is observed. We also find evidence to suggest that our changes enable Moses to execute additional reordering operations at decoding time that it was previously unable to perform.Chapter 1. Introduction 1 Chapter 2. Literature Review 6 2.1 Machine Translation. 6 2.2 Reordering 10 2.3 Korean to English MT. 12 Chapter 3. Corpus Data and SMT System. 14 3.1 Background 14 3.2 Preparation. 15 3.3 Moses 17 Chapter 4. Rule Generation. 19 4.1 Corpus Processing. 20 4.1.1 Suggested Korean-English Alignments. 21 4.1.2 Feature Sets 24 4.1.3 Reordering Movement. 26 4.2 Rule Creation. 33 4.3 Input Preprocessing. 35 4.3.1 Rule Matching. 35 4.3.2 Morpheme Reordering. 38 4.4 Examples 40 Chapter 5. Results 44 Chapter 6. Conclusion. 49 References 51 Appendix A: Rules 55 Abstract in Korean 64Maste
    corecore