3,285 research outputs found

    Bootstrapping Lexical Choice via Multiple-Sequence Alignment

    Get PDF
    An important component of any generation system is the mapping dictionary, a lexicon of elementary semantic expressions and corresponding natural language realizations. Typically, labor-intensive knowledge-based methods are used to construct the dictionary. We instead propose to acquire it automatically via a novel multiple-pass algorithm employing multiple-sequence alignment, a technique commonly used in bioinformatics. Crucially, our method leverages latent information contained in multi-parallel corpora -- datasets that supply several verbalizations of the corresponding semantics rather than just one. We used our techniques to generate natural language versions of computer-generated mathematical proofs, with good results on both a per-component and overall-output basis. For example, in evaluations involving a dozen human judges, our system produced output whose readability and faithfulness to the semantic input rivaled that of a traditional generation system.Comment: 8 pages; to appear in the proceedings of EMNLP-200

    A prototype system for machine translation from English to South African Sign Language using synchronous tree adjoining grammars

    Get PDF
    Thesis (MSc)--University of Stellenbosch, 2007.ENGLISH ABSTRACT: Machine translation, especially machine translation for sign languages, remains an active research area. Sign language machine translation presents unique challenges to the whole machine translation process. In this thesis a prototype machine translation system is presented. This system is designed to translate English text into a gloss based representation of South African Sign Language (SASL). In order to perform the machine translation, a transfer based approach was taken. English text is parsed into an intermediate representation. Translation rules are then applied to this intermediate representation to transform it into an equivalent intermediate representation for the SASL glosses. For both these intermediate representations, a tree adjoining grammar (TAG) formalism is used. As part of the prototype machine translation system, a TAG parser was implemented. The translation rules used by the system were derived from a SASL phrase book. This phrase book was also used to create a small gloss based SASL TAG grammar. Lastly, some additional tools, for the editing of TAG trees, were also added to the prototype system.AFRIKAANSE OPSOMMING: Masjienvertaling, veral masjienvertaling vir gebaretale, bly ’n aktiewe navorsingsgebied. Masjienvertaling vir gebaretale bied unieke uitdagings tot die hele masjienvertalingproses. In hierdie tesis bied ons ’n prototipe masjienvertalingstelsel aan. Hierdie stelsel is ontwerp om Engelse teks te vertaal na ’n glos gebaseerde voorstelling van Suid-Afrikaanse Gebaretaal (SAG). Ons vertalingstelsel maak gebruik van ’n oorplasingsbenadering tot masjienvertaling. Engelse teks word ontleed na ’n intermediˆere vorm. Vertalingre¨els word toegepas op hierdie intermediˆere vorm om dit te transformeer na ’n ekwivalente intermediˆere vorm vir die SAG glosse. Vir beide hierdie intermediˆere vorms word boomkoppelingsgrammatikas (BKGs) gebruik. As deel van die prototipe masjienvertalingstelsel, is ’n BKG sintaksontleder ge¨ımplementeer. Die vertalingre¨els wat gebruik word deur die stelsel, is afgelei vanaf ’n SAG fraseboek. Hierdie fraseboek was ook gebruik om ’n klein BKG vir SAG glosse te ontwikkel. Laastens was addisionele nutsfasiliteite, vir die redigering van BKG bome, ontwikkel

    A prototype system for machine translation from English to South African Sign Language using synchronous tree adjoining grammars

    Get PDF
    Thesis (MSc)--University of Stellenbosch, 2007.ENGLISH ABSTRACT: Machine translation, especially machine translation for sign languages, remains an active research area. Sign language machine translation presents unique challenges to the whole machine translation process. In this thesis a prototype machine translation system is presented. This system is designed to translate English text into a gloss based representation of South African Sign Language (SASL). In order to perform the machine translation, a transfer based approach was taken. English text is parsed into an intermediate representation. Translation rules are then applied to this intermediate representation to transform it into an equivalent intermediate representation for the SASL glosses. For both these intermediate representations, a tree adjoining grammar (TAG) formalism is used. As part of the prototype machine translation system, a TAG parser was implemented. The translation rules used by the system were derived from a SASL phrase book. This phrase book was also used to create a small gloss based SASL TAG grammar. Lastly, some additional tools, for the editing of TAG trees, were also added to the prototype system.AFRIKAANSE OPSOMMING: Masjienvertaling, veral masjienvertaling vir gebaretale, bly ’n aktiewe navorsingsgebied. Masjienvertaling vir gebaretale bied unieke uitdagings tot die hele masjienvertalingproses. In hierdie tesis bied ons ’n prototipe masjienvertalingstelsel aan. Hierdie stelsel is ontwerp om Engelse teks te vertaal na ’n glos gebaseerde voorstelling van Suid-Afrikaanse Gebaretaal (SAG). Ons vertalingstelsel maak gebruik van ’n oorplasingsbenadering tot masjienvertaling. Engelse teks word ontleed na ’n intermediˆere vorm. Vertalingre¨els word toegepas op hierdie intermediˆere vorm om dit te transformeer na ’n ekwivalente intermediˆere vorm vir die SAG glosse. Vir beide hierdie intermediˆere vorms word boomkoppelingsgrammatikas (BKGs) gebruik. As deel van die prototipe masjienvertalingstelsel, is ’n BKG sintaksontleder ge¨ımplementeer. Die vertalingre¨els wat gebruik word deur die stelsel, is afgelei vanaf ’n SAG fraseboek. Hierdie fraseboek was ook gebruik om ’n klein BKG vir SAG glosse te ontwikkel. Laastens was addisionele nutsfasiliteite, vir die redigering van BKG bome, ontwikkel

    Syntax-based machine translation using dependency grammars and discriminative machine learning

    Get PDF
    Machine translation underwent huge improvements since the groundbreaking introduction of statistical methods in the early 2000s, going from very domain-specific systems that still performed relatively poorly despite the painstakingly crafting of thousands of ad-hoc rules, to general-purpose systems automatically trained on large collections of bilingual texts which manage to deliver understandable translations that convey the general meaning of the original input. These approaches however still perform quite below the level of human translators, typically failing to convey detailed meaning and register, and producing translations that, while readable, are often ungrammatical and unidiomatic. This quality gap, which is considerably large compared to most other natural language processing tasks, has been the focus of the research in recent years, with the development of increasingly sophisticated models that attempt to exploit the syntactical structure of human languages, leveraging the technology of statistical parsers, as well as advanced machine learning methods such as marging-based structured prediction algorithms and neural networks. The translation software itself became more complex in order to accommodate for the sophistication of these advanced models: the main translation engine (the decoder) is now often combined with a pre-processor which reorders the words of the source sentences to a target language word order, or with a post-processor that ranks and selects a translation according according to fine model from a list of candidate translations generated by a coarse model. In this thesis we investigate the statistical machine translation problem from various angles, focusing on translation from non-analytic languages whose syntax is best described by fluid non-projective dependency grammars rather than the relatively strict phrase-structure grammars or projectivedependency grammars which are most commonly used in the literature. We propose a framework for modeling word reordering phenomena between language pairs as transitions on non-projective source dependency parse graphs. We quantitatively characterize reordering phenomena for the German-to-English language pair as captured by this framework, specifically investigating the incidence and effects of the non-projectivity of source syntax and the non-locality of word movement w.r.t. the graph structure. We evaluated several variants of hand-coded pre-ordering rules in order to assess the impact of these phenomena on translation quality. We propose a class of dependency-based source pre-ordering approaches that reorder sentences based on a flexible models trained by SVMs and and several recurrent neural network architectures. We also propose a class of translation reranking models, both syntax-free and source dependency-based, which make use of a type of neural networks known as graph echo state networks which is highly flexible and requires extremely little training resources, overcoming one of the main limitations of neural network models for natural language processing tasks

    Coherence in Machine Translation

    Get PDF
    Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and with no access to extra-sentential context. In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as evidenced via discourse relations. For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text

    Research in the Language, Information and Computation Laboratory of the University of Pennsylvania

    Get PDF
    This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue it’s easier than ever to do so: this document is accessible on the “information superhighway”. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authors’ abstracts in the web version of this report. The abstracts describe the researchers’ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn

    Software maintenance by program transformation in a wide spectrum language

    Get PDF
    This thesis addresses the software maintenance problem of extracting high-level designs from code. The investigated solution is to use a mathematically-based formal program transformation system. The resulting tool, the Maintainer's Assistant, is based on Ward's [177] WSL (wide spectrum language) and method of proving program equivalence. The problems addressed include: how to reverse engineer from code alone (the only reliable source of information about a program [158]), how to express program transformations within the system, what kinds of transformations should be incorporated, how to make the tool simple to use, how to perform abstraction and how to create a tool suitable for use with large programs. Using the Maintainer's Assistant, the program code is automatically translated into WSL and the transformations, although tested for valid applicability by the system, are interactively applied by the user. Notable features include a mathematical simplifier, a large flexible transformation catalogue and, significantly, the use of an extension of WSL, A4etaWSL, for representing the transformations. MetaWSL expands WSL by incorporating a variety of extensions, including: program editing statements, pattern matching and template filling functions, symbolic mathematics and logic functions, statements for moving within the program’s syntax tree and statements for repeating an operation at each node of the tree. Using MetaWSL, 80% of the 601 transformations can be expressed in less than 20 program statements. The Maintainer's Assistant has been used on a wide variety of examples of up to several thousand lines, including commercial software written in IBM 370 assembler. It has been possible to transform initially unstructured programs into a hierarchy of procedures, facilitating subsequent design recovery. These results show that program transformation is a viable method of renovating old (370 assembler) code in a cost elective way, and that MetaWSL provides an effective basis for clearly and concisely expressing the required transformations
    corecore