82 research outputs found

    Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

    Get PDF

    ANNOTATED DISJUNCT FOR MACHINE TRANSLATION

    Get PDF
    Most information found in the Internet is available in English version. However, most people in the world are non-English speaker. Hence, it will be of great advantage to have reliable Machine Translation tool for those people. There are many approaches for developing Machine Translation (MT) systems, some of them are direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses on developing an MT for less resourced languages i.e. languages that do not have available grammar formalism, parser, and corpus, such as some languages in South East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer approaches. Moreover, the unavailability of grammar formalism and parser in the target languages motivates us to develop a hybrid between direct and transfer approaches. This hybrid approach is referred as a hybrid transfer approach. This approach uses the Annotated Disjunct (ADJ) method. This method, based on Link Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and many-to-many word(s) translations. This method consists of transfer rules module which maps source words in a source sentence (SS) into target words in correct position in a target sentence (TS). The developed transfer rules are demonstrated on English ā†’ Indonesian translation tasks. An experimental evaluation is conducted to measure the performance of the developed system over available English-Indonesian MT systems. The developed ADJ-based MT system translated simple, compound, and complex English sentences in present, present continuous, present perfect, past, past perfect, and future tenses with better precision than other systems, with the accuracy of 71.17% in Subjective Sentence Error Rate metric

    Acquisition and modeling of lexical knowledge: a corpus-based investigation of systematic polysemy

    Get PDF

    Towards generic relation extraction

    Get PDF
    A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database that can be more effectively used for querying and automated reasoning. However, adapting conventional relation extraction systems to new domains or tasks requires significant effort from annotators and developers. Furthermore, previous adaptation approaches based on bootstrapping start from example instances of the target relations, thus requiring that the correct relation type schema be known in advance. Generic relation extraction (GRE) addresses the adaptation problem by applying generic techniques that achieve comparable accuracy when transferred, without modification of model parameters, across domains and tasks. Previous work on GRE has relied extensively on various lexical and shallow syntactic indicators. I present new state-of-the-art models for GRE that incorporate governordependency information. I also introduce a dimensionality reduction step into the GRE relation characterisation sub-task, which serves to capture latent semantic information and leads to significant improvements over an unreduced model. Comparison of dimensionality reduction techniques suggests that latent Dirichlet allocation (LDA) ā€“ a probabilistic generative approach ā€“ successfully incorporates a larger and more interdependent feature set than a model based on singular value decomposition (SVD) and performs as well as or better than SVD on all experimental settings. Finally, I will introduce multi-document summarisation as an extrinsic test bed for GRE and present results which demonstrate that the relative performance of GRE models is consistent across tasks and that the GRE-based representation leads to significant improvements over a standard baseline from the literature. Taken together, the experimental results 1) show that GRE can be improved using dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE for the content selection step of extractive summarisation and 3) validate the GRE claim of modification-free adaptation for the first time with respect to both domain and task. This thesis also introduces data sets derived from publicly available corpora for the purpose of rigorous intrinsic evaluation in the news and biomedical domains

    Looking Beyond the Canonical Formulation and Evaluation Paradigm of Prepositional Phrase Attachment

    Get PDF
    Prepositional phrase attachment has long been considered one of the most difficult tasks in automated syntactic parsing of natural language text. In this thesis, we examine several aspects of what has become the dominant view of PP attachment in natural language processing with an eye toward extending this view to a more realistic account of the problem. In particular, we take issue with the manner in which most PP attachment work is evaluated, and the degree to which traditional assumptions and simplifications no longer allow for realistically meaningful assessments. We also argue for looking beyond the canonical subset of attachment problems, where almost all attention has been focused, toward a fuller view of the task, both in terms of the types of ambiguities addressed and the contextual information considered

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF

    ANNOTATED DISJUNCT FOR MACHINE TRANSLATION

    Get PDF
    Most information found in the Internet is available in English version. However, most people in the world are non-English speaker. Hence, it will be of great advantage to have reliable Machine Translation tool for those people. There are many approaches for developing Machine Translation (MT) systems, some of them are direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses on developing an MT for less resourced languages i.e. languages that do not have available grammar formalism, parser, and corpus, such as some languages in South East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer approaches. Moreover, the unavailability of grammar formalism and parser in the target languages motivates us to develop a hybrid between direct and transfer approaches. This hybrid approach is referred as a hybrid transfer approach. This approach uses the Annotated Disjunct (ADJ) method. This method, based on Link Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and many-to-many word(s) translations. This method consists of transfer rules module which maps source words in a source sentence (SS) into target words in correct position in a target sentence (TS). The developed transfer rules are demonstrated on English ā†’ Indonesian translation tasks. An experimental evaluation is conducted to measure the performance of the developed system over available English-Indonesian MT systems. The developed ADJ-based MT system translated simple, compound, and complex English sentences in present, present continuous, present perfect, past, past perfect, and future tenses with better precision than other systems, with the accuracy of 71.17% in Subjective Sentence Error Rate metric

    Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing

    Get PDF
    This thesis focuses on unsupervised dependency parsingā€”parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language. The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous worksā€™ interpretable typological features that require parsed corpora or expert categorization of the language
    • ā€¦
    corecore