77 research outputs found

    Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text

    Get PDF
    Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genre- and domain-specificity, licensing restrictions, and the basic difficulty of locating parallel texts in all but the most dominant of the world's languages. A parallel corpus resource not yet explored is the World Wide Web, which hosts an abundance of pages in parallel translation, offering a potential solution to some of these problems and unique opportunities of its own. This paper presents the necessary first step in that exploration: a method for automatically finding parallel translated documents on the Web. The technique is conceptually simple, fully language independent, and scalable, and preliminary evaluation results indicate that the method may be accurate enough to apply without human intervention.Comment: LaTeX2e, 11 pages, 7 eps figures; uses psfig, llncs.cls, theapa.sty. An Appendix at http://umiacs.umd.edu/~resnik/amta98/amta98_appendix.html contains test dat

    Preference Learning for Machine Translation

    Get PDF
    Automatic translation of natural language is still (as of 2017) a long-standing but unmet promise. While advancing at a fast rate, the underlying methods are still far from actually being able to reliably capture syntax or semantics of arbitrary utterances of natural language, way off transporting the encoded meaning into a second language. However, it is possible to build useful translating machines when the target domain is well known and the machine is able to learn and adapt efficiently and promptly from new inputs. This is possible thanks to efficient and effective machine learning methods which can be applied to automatic translation. In this work we present and evaluate methods for three distinct scenarios: a) We develop algorithms that can learn from very large amounts of data by exploiting pairwise preferences defined over competing translations, which can be used to make a machine translation system robust to arbitrary texts from varied sources, but also enable it to learn effectively to adapt to new domains of data; b) We describe a method that is able to efficiently learn external models which adhere to fine-grained preferences that are extracted from a constricted selection of translated material, e.g. for adapting to users or groups of users in a computer-aided translation scenario; c) We develop methods for two machine translation paradigms, neural- and traditional statistical machine translation, to directly adapt to user-defined preferences in an interactive post-editing scenario, learning precisely adapted machine translation systems. In all of these settings, we show that machine translation can be made significantly more useful by careful optimization via preference learning

    Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

    Get PDF

    Example-based machine translation using the marker hypothesis

    Get PDF
    The development of large-scale rules and grammars for a Rule-Based Machine Translation (RBMT) system is labour-intensive, error-prone and expensive. Current research in Machine Translation (MT) tends to focus on the development of corpus-based systems which can overcome the problem of knowledge acquisition. Corpus-Based Machine Translation (CBMT) can take the form of Statistical Machine Translation (SMT) or Example-Based Machine Translation (EBMT). Despite the benefits of EBMT, SMT is currently the dominant paradigm and many systems classified as example-based integrate additional rule-based and statistical techniques. The benefits of an EBMT system which does not require extensive linguistic resources and can produce reasonably intelligible and accurate translations cannot be overlooked. We show that our linguistics-lite EBMT system can outperform an SMT system trained on the same data. The work reported in this thesis describes the development of a linguistics-lite EBMT system which does not have recourse to extensive linguistic resources. We apply the Marker Hypothesis (Green, 1979) — a psycholinguistic theory which states that all natural languages are ‘marked’ for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes. We use this technique in different environments to segment aligned (English, French) phrases and sentences. We then apply an alignment algorithm which can deduce smaller aligned chunks and words. Following a process similar to (Block, 2000), we generalise these alignments by replacing certain function words with an associated tag. In so doing, we cluster on marker words and add flexibility to our matching process. In a post hoc stage we treat the World Wide Web as a large corpus and validate and correct instances of determiner-noun and noun-verb boundary friction. We have applied our marker-based EBMT system to different bitexts and have explored its applicability in various environments. We have developed a phrase-based EBMT system (Gough et al., 2002; Way and Gough, 2003). We show that despite the perceived low quality of on-line MT systems, our EBMT system can produce good quality translations when such systems are used to seed its memories. (Carl, 2003a; Schaler et al., 2003) suggest that EBMT is more suited to controlled translation than RBMT as it has been known to overcome the ‘knowledge acquisition bottleneck’. To this end, we developed the first controlled EBMT system (Gough and Way, 2003; Way and Gough, 2004). Given the lack of controlled bitexts, we used an on-line MT system Logomedia to translate a set of controlled English sentences, We performed experiments using controlled analysis and generation and assessed the performance of our system at each stage. We made a number of improvements to our sub-sentential alignment algorithm and following some minimal adjustments to our system, we show that our controlled EBMT system can outperform an RBMT system. We applied the Marker Hypothesis to a more scalable data set. We trained our system on 203,529 sentences extracted from a Sun Microsystems Translation Memory. We thus reduced problems of data-sparseness and limited our dependence on Logomedia. We show that scaling up data in a marker-based EBMT system improves the quality of our translations. We also report on the benefits of extracting lexical equivalences from the corpus using Mutual Information

    Computational Etymology: Word Formation and Origins

    Get PDF
    While there are over seven thousand languages in the world, substantial language technologies exist only for a small percentage of these. The large majority of world languages do not have enough bilingual or even monolingual data for developing technologies like machine translation using current approaches. The computational study and modeling of word origins and word formation is a key step in developing comprehensive translation dictionaries for low-resource languages. This dissertation presents novel foundational work in computational etymology, a promising field which this work is pioneering. The dissertation also includes novel models of core vocabulary, dictionary information distillation, and of the diverse linguistic processes of word formation and concept realization between languages, including compounding, derivation, sense-extension, borrowing, and historical cognate relationships, utilizing statistical and neural models trained on the unprecedented scale of thousands of languages. Collectively these are important components in tackling the grand challenges of universal translation, endangered language documentation and revitalization, and supporting technologies for speakers of thousands of underserved languages

    Automatic construction of English/Chinese parallel corpus.

    Get PDF
    Li Kar Wing.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 88-96).Abstracts in English and Chinese.ABSTRACT --- p.iACKNOWLEDGEMENTS --- p.vLIST OF TABLES --- p.viiiLIST OF FIGURES --- p.ixCHAPTERSChapter 1. --- INTRODUCTION --- p.1Chapter 1.1 --- Application of corpus-based techniques --- p.2Chapter 1.1.1 --- Machine Translation (MT) --- p.2Chapter 1.1.1.1 --- Linguistic --- p.3Chapter 1.1.1.2 --- Statistical --- p.4Chapter 1.1.1.3 --- Lexicon construction --- p.4Chapter 1.1.2 --- Cross-lingual Information Retrieval (CLIR) --- p.6Chapter 1.1.2.1 --- Controlled vocabulary --- p.6Chapter 1.1.2.2 --- Free text --- p.7Chapter 1.1.2.3 --- Application corpus-based approach in CLIR --- p.9Chapter 1.2 --- Overview of linguistic resources --- p.10Chapter 1.3 --- Written language corpora --- p.12Chapter 1.3.1 --- Types of corpora --- p.13Chapter 1.3.2 --- Limitation of comparable corpora --- p.16Chapter 1.4 --- Outline of the dissertation --- p.17Chapter 2. --- LITERATURE REVIEW --- p.19Chapter 2.1 --- Research in automatic corpus construction --- p.20Chapter 2.2 --- Research in translation alignment --- p.25Chapter 2.2.1 --- Sentence alignment --- p.27Chapter 2.2.2 --- Word alignment --- p.28Chapter 2.3 --- Research in alignment of sequences --- p.33Chapter 3. --- ALIGNMENT AT WORD LEVEL AND CHARACTER LEVEL --- p.35Chapter 3.1 --- Title alignment --- p.35Chapter 3.1.1 --- Lexical features --- p.37Chapter 3.1.2 --- Grammatical features --- p.40Chapter 3.1.3 --- The English/Chinese alignment model --- p.41Chapter 3.2 --- Alignment at word level and character level --- p.42Chapter 3.2.1 --- Alignment at word level --- p.42Chapter 3.2.2 --- Alignment at character level: Longest matching --- p.44Chapter 3.2.3 --- Longest common subsequence(LCS) --- p.46Chapter 3.2.4 --- Applying LCS in the English/Chinese alignment model --- p.48Chapter 3.3 --- Reduce overlapping ambiguity --- p.52Chapter 3.3.1 --- Edit distance --- p.52Chapter 3.3.2 --- Overlapping in the algorithm model --- p.54Chapter 4. --- ALIGNMENT AT TITLE LEVEL --- p.59Chapter 4.1 --- Review of score functions --- p.59Chapter 4.2 --- The Score function --- p.60Chapter 4.2.1 --- (C matches E) and (E matches C) --- p.60Chapter 4.2.2 --- Length similarity --- p.63Chapter 5. --- EXPERIMENTAL RESULTS --- p.69Chapter 5.1 --- Hong Kong government press release articles --- p.69Chapter 5.2 --- Hang Seng Bank economic monthly reports --- p.76Chapter 5.3 --- Hang Seng Bank press release articles --- p.78Chapter 5.4 --- Hang Seng Bank speech articles --- p.81Chapter 5.5 --- Quality of the collections and future work --- p.84Chapter 6. --- CONCLUSION --- p.87Bibliograph

    Using parallel corpora for translation-oriented term extraction

    Get PDF
    In many scientific, technological or political fields terminology and the production of up-to- date reference works is lagging behind, which causes problems to translators and results in inconsistent translations. Parallel corpora of texts already translated can be used as a resource for automatic extraction of terms and terminological collocations. The paper describes how a methodology for multi-word term extraction and bilingual conceptual mapping was developed for Slovene-English terms. We used word-to-word alignment to extract a bilingual glossary of single-word terms, and for multi-word terms two methods were tested and compared. The statistical method is broadly applicable but gives results of very limited use, while the method of syntactic patterns extracts highly useful terminological phrases, however only from a tagged corpus. A vision of further development is given and how these methods might be incorporated into existing translation tools
    corecore