22 research outputs found

    Robust large-scale EBMT with marker-based segmentation

    Get PDF
    Previous work on marker-based EBMT [Gough & Way, 2003, Way & Gough, 2004] suffered from problems such as data-sparseness and disparity between the training and test data. We have developed a large-scale robust EBMT system. In a comparison with the systems listed in [Somers, 2003], ours is the third largest EBMT system and certainly the largest English-French EBMT system. Previous work used the on-line MT system Logomedia to translate source language material as a means of populating the system’s database where bitexts were unavailable. We derive our sententially aligned strings from a Sun Translation Memory (TM) and limit the integration of Logomedia to the derivation of our word-level lexicon. We also use Logomedia to provide a baseline comparison for our system and observe that we outperform Logomedia and previous marker-based EBMT systems in a number of tests

    f-align: An open-source alignment tool for LFG f-structures

    Get PDF
    Lexical-Functional Grammar (LFG) f-structures (Kaplan and Bresnan, 1982) have attracted some attention in recent years as an intermediate data representation for statistical machine translation. So far, however, there are no alignment tools capable of aligning f-structures directly, and plain word alignment tools are used for this purpose. In this way no use is made of the structural information contained in f-structures. We present the first version of a specialized f-structure alignment open-source software tool

    Teaching and assessing empirical approaches to machine translation

    Get PDF
    Empirical methods in Natural Language Processing (NLP) and Machine Translation (MT) have become mainstream in the research field. Accordingly, it is important that the tools and techniques in these paradigms be taught to potential future researchers and developers in University courses. While many dedicated courses on Statistical NLP can be found, there are few, if any courses on Empirical Approaches to MT. This paper presents the development and assessment of one such course as taught to final year undergraduates taking a degree in NLP

    The Effectiveness of Two-Phase Translation Method compared to Every-Match Method in Vocabulary Translation

    Get PDF
    The use of a dictionary will be more efficient with a digital dictionary because it may be used anytime and wherever we are. Vocabulary search in digital dictionaries can use several search methods or vocabulary translators, including the Every-Match translator method and the Two-Phase translator method. The Every-Match and Two-Phase translator methods were used to search on Indonesian, English, and Arabic vocabulary in this research. The accuracy of the two methods will then be analyzed to see which is the more accurate translation method. The experimental methodology is used in this study, with the aim to determine the effect of a treatment on the experimental group's results. The Every-Match and Two-Phase translation methods are used in this multilingual digital dictionary. A process of matching keywords with a database in this digital dictionary system. Keywords in Indonesian will be used as a basis for searching in this digital dictionary. The search will be divided into two phases: a search using the same keywords and a search using databases that have keywords that are similar to the ones entered. The goal is to avoid keyword typing errors. And, based on the results of the data analysis, the Mann Whitney U test shows that there is a very significant difference between the Every-Match method and the Two-Phase method in translating words from Indonesian in English to Arabic. The average value of Every-Match was 34.8 or 70%, and the translation using the Two-Phase method was 41.7 or 83%

    Selective Sampling for Example-based Word Sense Disambiguation

    Full text link
    This paper proposes an efficient example sampling method for example-based word sense disambiguation systems. To construct a database of practical size, a considerable overhead for manual sense disambiguation (overhead for supervision) is required. In addition, the time complexity of searching a large-sized database poses a considerable problem (overhead for search). To counter these problems, our method selectively samples a smaller-sized effective subset from a given example set for use in word sense disambiguation. Our method is characterized by the reliance on the notion of training utility: the degree to which each example is informative for future example sampling when used for the training of the system. The system progressively collects examples by selecting those with greatest utility. The paper reports the effectiveness of our method through experiments on about one thousand sentences. Compared to experiments with other example sampling methods, our method reduced both the overhead for supervision and the overhead for search, without the degeneration of the performance of the system.Comment: 25 pages, 14 Postscript figure

    Inducing translation templates with type constraints

    Get PDF
    This paper presents a generalization technique that induces translation templates from a given set of translation examples by replacing differing parts in the examples with typed variables. Since the type of each variable is inferred during the learning process, each induced template is also associated with a set of type constraints. The type constraints that are associated with a translation template restrict the usage of the translation template in certain contexts in order to avoid some of the wrong translations. The types of variables are induced using type lattices designed for both the source and target languages. The proposed generalization technique has been implemented as a part of an example-based machine translation system. © Springer Science+Business Media 2007

    wEBMT: developing and validating an example-based machine translation system using the world wide web

    Get PDF
    We have developed an example-based machine translation (EBMT) system that uses the World Wide Web for two different purposes: First, we populate the system’s memory with translations gathered from rule-based MT systems located on the Web. The source strings input to these systems were extracted automatically from an extremely small subset of the rule types in the Penn-II Treebank. In subsequent stages, the (source, target) translation pairs obtained are automatically transformed into a series of resources that render the translation process more successful. Despite the fact that the output from on-line MT systems is often faulty, we demonstrate in a number of experiments that when used to seed the memories of an EBMT system, they can in fact prove useful in generating translations of high quality in a robust fashion. In addition, we demonstrate the relative gain of EBMT in comparison to on-line systems. Second, despite the perception that the documents available on the Web are of questionable quality, we demonstrate in contrast that such resources are extremely useful in automatically postediting translation candidates proposed by our system
    corecore