8 research outputs found

    Acquiring Compound Word Translations both Automatically and Dynamically

    Get PDF
    This paper addresses the problem of compound word translation and proposes the approaches to acquiring translations. The proposed approaches focus on exploring web data and utilizing English translations to link words of the source language and the correspondences in the target language. The paper uses Japanese-Chinese language pairs for the sake of illustration and shows initial experimental results. The proposed method is language-independent and therefore can be applied to other language pairs

    TakeTwo: A Word Aligner based on Self Learning

    Get PDF

    Learning source-target surface patterns for web-based terminology translation

    Full text link
    This paper introduces a method for learn-ing to find translation of a given source term on the Web. In the approach, the source term is used as a query and part of patterns to retrieve and extract transla-tions in Web pages. The method involves using a bilingual term list to learn source-target surface patterns. At runtime, the given term is submitted to a search engine then the candidate translations are ex-tracted from the returned summaries and subsequently ranked based on the surface patterns, occurrence counts, and translit-eration knowledge. We present a proto-type called TermMine that applies the method to translate terms. Evaluation on a set of encyclopedia terms shows that the method significantly outperforms the state-of-the-art online machine translation systems.

    Machine transliteration of proper names between English and Persian

    Get PDF
    Machine transliteration is the process of automatically transforming a word from a source language to a target language while preserving pronunciation. The transliterated words in the target language are called out-of-dictionary, or sometimes out-of-vocabulary, meaning that they have been borrowed from other languages with a change of script. When a whole text is being translated, for example, then proper nouns and technical terms are subject to transliteration. Machine translation, and other applications which make use of this technology, such as cross-lingual information retrieval and cross-language question answering, deal with the problem of transliteration. Since proper nouns and technical terms - which need phonetical translation - are part of most text documents, transliteration is an important problem to study. We explore the problem of English to Persian and Persian to English transliteration using methods that work based on the grapheme of the source word. One major problem in handling Persian text is its lack of written short vowels. When transliterating Persian words to English, we need to develop a method of inserting vowels to make them pronounceable. Many different approaches using n-grams are explored and compared in this thesis, and we propose language-specific transliteration methods that improved transliteration accuracy. Our novel approaches use consonant-vowel sequences, and show significant improvements over baseline systems. We also develop a new alignment algorithm, and examine novel techniques to combine systems; approaches which improve the effectiveness of the systems. We also investigate the properties of bilingual corpora that affect transliteration accuracy. Our experiments suggest that the origin of the source words has a strong effect on the performance of transliteration systems. From the careful analysis of the corpus construction process, we conclude that at least five human transliterators are needed to construct a representative bilingual corpus that is used for the training and testing of transliteration systems

    Extraction et Complétion de Terminologies Multilingues

    Get PDF
    Until now, automatic terminology extraction techniques have been often targeted towards monolingual corpora that are homogeneous from a language register point of view. This work, carried out in the context of a CIFRE convention, extends this objective to non-edited textual data written in typologically diverse languages, in order to extract « field terms ».This work focuses on the analysis of verbatim produced in the context of employee surveys carried out within multinational companies and processed by the Verbatim Analysis - VERA company. It involves the design and development of a processing pipeline for automatically extracting terminologies in a virtually language-independent, register-independent and domain-independent way.Based on an assessment of the typological properties of seven diverse languages, we propose a preliminary text pre-processing step prepares the training of models. This step is partly necessary (tokenization) and partly optional (removal of part of the morphological information). We compute from the resulting data a series of numerical features (statistical and frequency-based) used for training statistical models (CRFs).We select a first set of best models by means of an automatic dedicated evaluation of the extracted terms produced in each of the experimental settings considered for each languages. We then carry out a second series of evaluations for assessing the usability of these models on languages that differ from their training languages. Our results tend to demonstrate that the quality of the field terms that we extract is satisfying. The best scores we obtain (in a monolingual setting) are above 0, 9 for most languages. These scores can even be further improved for several languages by using some of the best models trained on other languages ; as a result, our approach could prove useful for extracting terminologies in languages for which such models are not available.Les processus d’extraction terminologique automatique ont Ă©tĂ© jusqu’ici majoritairement conçus pour ĂȘtre appliquĂ©s Ă  des corpus monolingues et dans des registres de langue uniformes. Cette thĂšse, rĂ©alisĂ©e dans le cadre d’une convention CIFRE, prolonge cet objectif pour une application Ă  des donnĂ©es textuelles bruitĂ©es et issues de langues de plus en plus variĂ©es, pour l’extraction de « termes de terrain ».Ce travail s’inscrit dans le cadre de l’analyse de verbatim issus d’enquĂȘtes internes au sein de multinationales traitĂ©es par l’entreprise Verbatim Analysis - VERA ; il consiste Ă  Ă©laborer une sĂ©quence de traitements pour l’extraction automatique de terminologies qui soit faiblement dĂ©pendante de la langue, du registre de langue ou du domaine.Suivant une rĂ©flexion fondĂ©e sur diffĂ©rents aspects de typologie linguistique appliquĂ©e Ă  sept langues, nous proposons des prĂ©traitements textuels prĂ©liminaires Ă  l’entraĂźnement de modĂšles. Ces derniers sont soit indispensables (segmentation en tokens), soit optionnels (amputation d’une partie de l’information morphologique). Sur l’ensemble des donnĂ©es ainsi produites, nous calculons des traits numĂ©riques(statistiques ou frĂ©quentiels) pour l’entraĂźnement des modĂšles statistiques de type CRF. Nous sĂ©lectionnons un ensemble de meilleurs modĂšles grĂące Ă  une Ă©valuation automatisĂ©e, au moyen d’une mĂ©trique adaptĂ©e, des termes extraits par les modĂšles produits pour l’ensemble des cadres expĂ©rimentaux envisagĂ©s pour chaque langue. Nous rĂ©alisons alors une seconde sĂ©rie d’évaluations pour Ă©tudier l’exploitabilitĂ© deces modĂšles pour d’autres langues que celles sur lesquelles ils ont Ă©tĂ© entraĂźnĂ©s. Il ressort de ces expĂ©riences que cette mĂ©thode aboutit Ă  une extraction de termes de terrain de qualitĂ© satisfaisante. Les meilleurs scores obtenus (pour une Ă©valuationmonolingue des modĂšles) se situent, pour la majoritĂ© des langues, au-dessus de l’iso-ligne de f-score 0, 9. Ces scores peuvent mĂȘme ĂȘtre amĂ©liorĂ©s pour certaines langues grĂące Ă  l’application trans-lingue des meilleurs modĂšles d’autres langues ; il en ressort que notre approche constitue potentiellement un bon levier Ă  des extractions terminologiques pour des langues ne disposant pas de leurs propres modĂšles.La seconde partie de notre travail prĂ©sente nos travaux relatifs Ă  la complĂ©tion automatique de terminologies structurĂ©es multilingues. Nous avons proposĂ© et Ă©valuĂ© deux algorithmes de complĂ©tion qui prennent en entrĂ©e un graphe de traduction multilingue (que nous construisons Ă  partir de ressources libres) et une terminologie multilingue structurĂ©e. Ils proposent alors de nouveaux candidats termes pour cette derniĂšre. Notre approche permet de complĂ©ter la terminologie structurĂ©e dans une langue qu’elle couvre dĂ©jĂ , mais Ă©galement d’étendre sa couverture Ă  de nou-velles langue. L’un de ces algorithmes est Ă©galement appliquĂ© au wordnet du français WOLF, ce qui en permet une amĂ©lioration importante de la couverture

    La dimensione cognitiva nella traduzione assistita da computer e nella traduzione automatica

    Get PDF
    La tesi sviluppa una riflessione critica sull’utilizzo delle tecnologie per la traduzione, cercando di comprendere come esse siano state sviluppate a partire da rappresentazioni diverse del processo traduttivo umano, e di identificare le ricadute positive e negative che esse producono sul processo di traduzione, sia a livello pratico sia piĂč profondamente, ovvero sulle dinamiche cognitive che lo caratterizzano e sul nuovo modo in cui il traduttore percepisce gli strumenti stessi, il testo e persino il proprio lavoro.The work presents the wide range of translation technologies today available, trying to shed light on the different representations of the human translation process that they try to reproduce, and to identify the advantages and drawbacks of their adoption within the translation process, both in practice and at a deeper level, i.e. on the cognitive dynamics that characterize it and on the new way in which translators perceive these tools, the text to be translated and even their role

    Using the Web as a Bilingual Dictionary

    No full text
    We present a system for extracting an English translation of a given Japanese technical term by collecting and scoring translation candidates from the web
    corecore