Search CORE

8 research outputs found

Acquiring Compound Word Translations both Automatically and Dynamically

Author: Zhang Yujie
井佐原均
Publication venue: Logico-Linguistic Society of Japan
Publication date: 16/11/2005
Field of study

This paper addresses the problem of compound word translation and proposes the approaches to acquiring translations. The proposed approaches focus on exploring web data and utilizing English translations to link words of the source language and the correspondences in the target language. The paper uses Japanese-Chinese language pairs for the sake of illustration and shows initial experimental results. The proposed method is language-independent and therefore can be applied to other language pairs

Waseda University Repository

TakeTwo: A Word Aligner based on Self Learning

Author: Chang Jason S.
Chang Jim
Wu Jian-Cheng
Publication venue: Department of Linguistics, Faculty of Arts, Chulalongkorn University
Publication date: 01/01/2014
Field of study

Waseda University Repository

Learning source-target surface patterns for web-based terminology translation

Author: Jason S. Chang
Jian-cheng Wu
Tracy Lin
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

This paper introduces a method for learn-ing to find translation of a given source term on the Web. In the approach, the source term is used as a query and part of patterns to retrieve and extract transla-tions in Web pages. The method involves using a bilingual term list to learn source-target surface patterns. At runtime, the given term is submitted to a search engine then the candidate translations are ex-tracted from the returned summaries and subsequently ranked based on the surface patterns, occurrence counts, and translit-eration knowledge. We present a proto-type called TermMine that applies the method to translate terms. Evaluation on a set of encyclopedia terms shows that the method significantly outperforms the state-of-the-art online machine translation systems.

CiteSeerX

Crossref

Machine transliteration of proper names between English and Persian

Author: Karimi S
Publication venue: RMIT University
Publication date
Field of study

Machine transliteration is the process of automatically transforming a word from a source language to a target language while preserving pronunciation. The transliterated words in the target language are called out-of-dictionary, or sometimes out-of-vocabulary, meaning that they have been borrowed from other languages with a change of script. When a whole text is being translated, for example, then proper nouns and technical terms are subject to transliteration. Machine translation, and other applications which make use of this technology, such as cross-lingual information retrieval and cross-language question answering, deal with the problem of transliteration. Since proper nouns and technical terms - which need phonetical translation - are part of most text documents, transliteration is an important problem to study. We explore the problem of English to Persian and Persian to English transliteration using methods that work based on the grapheme of the source word. One major problem in handling Persian text is its lack of written short vowels. When transliterating Persian words to English, we need to develop a method of inserting vowels to make them pronounceable. Many different approaches using n-grams are explored and compared in this thesis, and we propose language-specific transliteration methods that improved transliteration accuracy. Our novel approaches use consonant-vowel sequences, and show significant improvements over baseline systems. We also develop a new alignment algorithm, and examine novel techniques to combine systems; approaches which improve the effectiveness of the systems. We also investigate the properties of bilingual corpora that affect transliteration accuracy. Our experiments suggest that the origin of the source words has a strong effect on the performance of transliteration systems. From the careful analysis of the corpus construction process, we conclude that at least five human transliterators are needed to construct a representative bilingual corpus that is used for the training and testing of transliteration systems

RMIT Research Repository

Extraction et Complétion de Terminologies Multilingues

Author: Hanoka Valérie
Publication venue: HAL CCSD
Publication date: 06/07/2015
Field of study

Until now, automatic terminology extraction techniques have been often targeted towards monolingual corpora that are homogeneous from a language register point of view. This work, carried out in the context of a CIFRE convention, extends this objective to non-edited textual data written in typologically diverse languages, in order to extract « field terms ».This work focuses on the analysis of verbatim produced in the context of employee surveys carried out within multinational companies and processed by the Verbatim Analysis - VERA company. It involves the design and development of a processing pipeline for automatically extracting terminologies in a virtually language-independent, register-independent and domain-independent way.Based on an assessment of the typological properties of seven diverse languages, we propose a preliminary text pre-processing step prepares the training of models. This step is partly necessary (tokenization) and partly optional (removal of part of the morphological information). We compute from the resulting data a series of numerical features (statistical and frequency-based) used for training statistical models (CRFs).We select a first set of best models by means of an automatic dedicated evaluation of the extracted terms produced in each of the experimental settings considered for each languages. We then carry out a second series of evaluations for assessing the usability of these models on languages that differ from their training languages. Our results tend to demonstrate that the quality of the field terms that we extract is satisfying. The best scores we obtain (in a monolingual setting) are above 0, 9 for most languages. These scores can even be further improved for several languages by using some of the best models trained on other languages ; as a result, our approach could prove useful for extracting terminologies in languages for which such models are not available.Les processus d’extraction terminologique automatique ont été jusqu’ici majoritairement conçus pour être appliqués à des corpus monolingues et dans des registres de langue uniformes. Cette thèse, réalisée dans le cadre d’une convention CIFRE, prolonge cet objectif pour une application à des données textuelles bruitées et issues de langues de plus en plus variées, pour l’extraction de « termes de terrain ».Ce travail s’inscrit dans le cadre de l’analyse de verbatim issus d’enquêtes internes au sein de multinationales traitées par l’entreprise Verbatim Analysis - VERA ; il consiste à élaborer une séquence de traitements pour l’extraction automatique de terminologies qui soit faiblement dépendante de la langue, du registre de langue ou du domaine.Suivant une réflexion fondée sur différents aspects de typologie linguistique appliquée à sept langues, nous proposons des prétraitements textuels préliminaires à l’entraînement de modèles. Ces derniers sont soit indispensables (segmentation en tokens), soit optionnels (amputation d’une partie de l’information morphologique). Sur l’ensemble des données ainsi produites, nous calculons des traits numériques(statistiques ou fréquentiels) pour l’entraînement des modèles statistiques de type CRF. Nous sélectionnons un ensemble de meilleurs modèles grâce à une évaluation automatisée, au moyen d’une métrique adaptée, des termes extraits par les modèles produits pour l’ensemble des cadres expérimentaux envisagés pour chaque langue. Nous réalisons alors une seconde série d’évaluations pour étudier l’exploitabilité deces modèles pour d’autres langues que celles sur lesquelles ils ont été entraînés. Il ressort de ces expériences que cette méthode aboutit à une extraction de termes de terrain de qualité satisfaisante. Les meilleurs scores obtenus (pour une évaluationmonolingue des modèles) se situent, pour la majorité des langues, au-dessus de l’iso-ligne de f-score 0, 9. Ces scores peuvent même être améliorés pour certaines langues grâce à l’application trans-lingue des meilleurs modèles d’autres langues ; il en ressort que notre approche constitue potentiellement un bon levier à des extractions terminologiques pour des langues ne disposant pas de leurs propres modèles.La seconde partie de notre travail présente nos travaux relatifs à la complétion automatique de terminologies structurées multilingues. Nous avons proposé et évalué deux algorithmes de complétion qui prennent en entrée un graphe de traduction multilingue (que nous construisons à partir de ressources libres) et une terminologie multilingue structurée. Ils proposent alors de nouveaux candidats termes pour cette dernière. Notre approche permet de compléter la terminologie structurée dans une langue qu’elle couvre déjà, mais également d’étendre sa couverture à de nou-velles langue. L’un de ces algorithmes est également appliqué au wordnet du français WOLF, ce qui en permet une amélioration importante de la couverture

Thèses en Ligne

INRIA a CCSD electronic archive server

Hal-Diderot

La dimensione cognitiva nella traduzione assistita da computer e nella traduzione automatica

Author: RONDELLO Marilena
Publication venue: place:Palermo
Publication date
Field of study

La tesi sviluppa una riflessione critica sull’utilizzo delle tecnologie per la traduzione, cercando di comprendere come esse siano state sviluppate a partire da rappresentazioni diverse del processo traduttivo umano, e di identificare le ricadute positive e negative che esse producono sul processo di traduzione, sia a livello pratico sia più profondamente, ovvero sulle dinamiche cognitive che lo caratterizzano e sul nuovo modo in cui il traduttore percepisce gli strumenti stessi, il testo e persino il proprio lavoro.The work presents the wide range of translation technologies today available, trying to shed light on the different representations of the human translation process that they try to reproduce, and to identify the advantages and drawbacks of their adoption within the translation process, both in practice and at a deeper level, i.e. on the cognitive dynamics that characterize it and on the new way in which translators perceive these tools, the text to be translated and even their role

Archivio istituzionale della ricerca - Università di Palermo

Using the Web as a Bilingual Dictionary

Author: Masaaki Nagata
Masaaki Nagata Ntt
Teruka Saito
Publication venue
Publication date: 01/01/2001
Field of study

We present a system for extracting an English translation of a given Japanese technical term by collecting and scoring translation candidates from the web

CiteSeerX

Crossref