6,402 research outputs found

    Mutual terminology extraction using a statistical framework

    Get PDF
    In this paper, we explore a statistical framework for mutual bilingual terminology extraction. We propose three probabilistic models to assess the proposition that automatic alignment can play an active role in bilingual terminology extraction and translate it into mutual bilingual terminology extraction. The results indicate that such models are valid and can show that mutual bilingual terminology extraction is indeed a viable approach

    Linguistically-based sub-sentential alignment for terminology extraction from a bilingual automotive corpus

    Get PDF
    We present a sub-sentential alignment system that links linguistically motivated phrases in parallel texts based on lexical correspondences and syntactic similarity. We compare the performance of our sub-sentential alignment system with different symmetrization heuristics that combine the GIZA++ alignments of both translation directions. We demonstrate that the aligned linguistically motivated phrases are a useful means to extract bilingual terminology and more specifically complex multiword terms

    Terminology extraction from English-Portuguese and English-Galician parallel corpora based on probabilistic translation dictionaries and bilingual syntactic patterns

    Get PDF
    This paper presents a research on parallel corpora-based bilingual terminology extraction based on the occurrence of bilingual morphosyntactic patterns in the probabilistic translation dictionaries generated by NATools. To evaluate this method, we carried out an experiment in which both the level of lexical cohesion of the term candidates and their specificity with respect to a non-terminological corpus of the target language were taken into account. The evaluation results show a high degree of accuracy of the terminology extraction based on probabilistic translation dictionaries complemented by bilingual syntactic patterns

    Corpus exploitation for terminological purposes: a proposed term extraction process for bilingual specialised dictionary elaboration

    Get PDF
    The method and outcomes described in this research are part of a wider project aimed at creating a specialised, bilingual dictionary on industrial ceramics. One of the key stages in the elaboration of such a work is the term extraction process, since it determines which terminological units are part of the domain and should thus be included as dictionary entries. This paper deals with a proposal that is aimed at accurately detecting the units of specialised knowledge that constitute the terminology of a domain from a raw, bilingual corpus that was compiled ad hoc. This proposal is presented in the form of a series of stages in which prospective terms are progressively retrieved, identified and analysed with the concordance software program WordSmith Tools (WST) 5.0. In the method proposed here, the WST application WordList first generates monolexical frequency lists which are compared with the lists generated by the tool KeyWords, providing saliency data. Afterwards, Mutual Information lists provided once again by Wordlist constitute the first approach to the combinatorial aspect of terms. Subsequently, the WST application Concord, with its different options, provides further evidence on the way prospective terms collocate and combine as well as on their contextual nature, thus completing the term extraction process. This methodology, always combined with the terminographer’s “manual” work, observation and intuition, has proved to be effective for the dictionary under development

    Automatic Discovery of Non-Compositional Compounds in Parallel Data

    Full text link
    Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of non-compositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunciations.Comment: 12 pages; uses natbib.sty, here.st

    Parallel corpus-based bilingual terminology extraction

    Get PDF
    This paper presents a parallel corpora-based bilingual terminology extraction method based on the occurrence of bilingual morphosyntactic patterns in probabilistic translation dictionaries. We discuss an experiment focused on two language pairs – English-Galician and English-Portuguese, and show results which experimentally confirm the high degree of accuracy of the proposed extraction technique.(undefined
    • …
    corecore