10,990 research outputs found

    Word-to-Word Models of Translational Equivalence

    Full text link
    Parallel texts (bitexts) have properties that distinguish them from other kinds of parallel data. First, most words translate to only one other word. Second, bitext correspondence is noisy. This article presents methods for biasing statistical translation models to reflect these properties. Analysis of the expected behavior of these biases in the presence of sparse data predicts that they will result in more accurate models. The prediction is confirmed by evaluation with respect to a gold standard -- translation models that are biased in this fashion are significantly more accurate than a baseline knowledge-poor model. This article also shows how a statistical translation model can take advantage of various kinds of pre-existing knowledge that might be available about particular language pairs. Even the simplest kinds of language-specific knowledge, such as the distinction between content words and function words, is shown to reliably boost translation model performance on some tasks. Statistical models that are informed by pre-existing knowledge about the model domain combine the best of both the rationalist and empiricist traditions

    Capturing translational divergences with a statistical tree-to-tree aligner

    Get PDF
    Parallel treebanks, which comprise paired source-target parse trees aligned at sub-sentential level, could be useful for many applications, particularly data-driven machine translation. In this paper, we focus on how translational divergences are captured within a parallel treebank using a fully automatic statistical tree-to-tree aligner. We observe that while the algorithm performs well at the phrase level, performance on lexical-level alignments is compromised by an inappropriate bias towards coverage rather than precision. This preference for high precision rather than broad coverage in terms of expressing translational divergences through tree-alignment stands in direct opposition to the situation for SMT word-alignment models. We suggest that this has implications not only for tree-alignment itself but also for the broader area of induction of syntaxaware models for SMT

    Partial Perception and Approximate Understanding

    Get PDF
    What is discussed in the present paper is the assumption concerning a human narrowed sense of perception of external world and, resulting from this, a basically approximate nature of concepts that are to portray it. Apart from the perceptual vagueness, other types of vagueness are also discussed, involving both the nature of things, indeterminacy of linguistic expressions and psycho-sociological conditioning of discourse actions in one language and in translational contexts. The second part of the paper discusses the concept of conceptual and linguistic resemblance (similarity, equivalence) and discourse approximating strategies and proposes a Resemblance Matrix, presenting ways used to narrow the approximation gap between the interacting parties in monolingual and translational discourses

    Models of Co-occurrence

    Get PDF
    A model of co-occurrence in bitext is a boolean predicate that indicates whether a given pair of word tokens co-occur in corresponding regions of the bitext space. Co-occurrence is a precondition for the possibility that two tokens might be mutual translations. Models of co-occurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models into an integrated system for exploiting parallel texts. Different models of co-occurrence are possible, depending on the kind of bitext map that is available, the language-specific information that is available, and the assumptions made about the nature of translational equivalence. Although most statistical translation models are based on models of co-occurrence, modeling co-occurrence correctly is more difficult than it may at first appear

    Meaningfulness, the unsaid and translatability. Instead of an introduction

    Get PDF
    The present paper opens this topical issue on translation techniques by drawing a theoretical basis for the discussion of translational issues in a linguistic perspective. In order to forward an audience- oriented definition of translation, I will describe different forms of linguistic variability, highlighting how they present different difficulties to translators, with an emphasis on the semantic and communicative complexity that a source text can exhibit. The problem is then further discussed through a comparison between Quine's radically holistic position and the translatability principle supported by such semanticists as Katz. General translatability — at the expense of additional complexity — is eventually proposed as a possible synthesis of this debate. In describing the meaningfulness levels of source texts through Hjelmslevian semiotics, and his semiotic hierarchy in particular, the paper attempts to go beyond denotative semiotic, and reframe some translational issues in a connotative semiotic and metasemiotic perspective

    Manual Annotation of Translational Equivalence: The Blinker Project

    Get PDF
    Bilingual annotators were paid to link roughly sixteen thousand corresponding words between on-line versions of the Bible in modern French and modern English. These annotations are freely available to the research community from http://www.cis.upenn.edu/~melamed . The annotations can be used for several purposes. First, they can be used as a standard data set for developing and testing translation lexicons and statistical translation models. Second, researchers in lexical semantics will be able to mine the annotations for insights about cross-linguistic lexicalization patterns. Third, the annotations can be used in research into certain recently proposed methods for monolingual word-sense disambiguation. This paper describes the annotated texts, the specially-designed annotation tool, and the strategies employed to increase the consistency of the annotations. The annotation process was repeated five times by different annotators. Inter-annotator agreement rates indicate that the annotations are reasonably reliable and that the method is easy to replicate

    An annotation scheme and gold standard for Dutch-English word alignment

    Get PDF
    The importance of sentence-aligned parallel corpora has been widely acknowledged. Reference corpora in which sub-sentential translational correspondences are indicated manually are more labour-intensive to create, and hence less wide-spread. Such manually created reference alignments - also called Gold Standards - have been used in research projects to develop or test automatic word alignment systems. In most translations, translational correspondences are rather complex; for example word-by-word correspondences can be found only for a limited number of words. A reference corpus in which those complex translational correspondences are aligned manually is therefore also a useful resource for the development of translation tools and for translation studies. In this paper, we describe how we created a Gold Standard for the Dutch-English language pair. We present the annotation scheme, annotation guidelines, annotation tool and inter-annotator results. To cover a wide range of syntactic and stylistic phenomena that emerge from different writing and translation styles, our Gold Standard data set contains texts from different text types. The Gold Standard will be publicly available as part of the Dutch Parallel Corpus

    Automatic Construction of Clean Broad-Coverage Translation Lexicons

    Full text link
    Word-level translational equivalences can be extracted from parallel texts by surprisingly simple statistical techniques. However, these techniques are easily fooled by {\em indirect associations} --- pairs of unrelated words whose statistical properties resemble those of mutual translations. Indirect associations pollute the resulting translation lexicons, drastically reducing their precision. This paper presents an iterative lexicon cleaning method. On each iteration, most of the remaining incorrect lexicon entries are filtered out, without significant degradation in recall. This lexicon cleaning technique can produce translation lexicons with recall and precision both exceeding 90\%, as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9

    Synonymy and Polysemy in Legal Terminology and Their Applications to Bilingual and Bijural Translation

    Get PDF
    The paper focuses on synonymy and polysemy in the language of law in English-speaking countries. The introductory part briefly outlines the process of legal translation and tackle the specificity of bijural translation. Then, traditional understanding of what a term is and its application to legal terminology is considered; three different levels of vocabulary used in legal texts are outlined and their relevance to bijural translation explained. Next, synonyms in the language of law are considered with respect to their intension and distribution, and examples are given to show that most expressions or phrases which are interchangeable synonyms in the general language should be treated carefully in legal translation. Finally, polysemes in legal terminology are discussed and examples given to illustrate problems potentially encountered by translators
    corecore