452 research outputs found

    Sentence Alignment using MR and GA

    Get PDF
    In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on mathematical regression (MR) and genetic algorithm (GA) classifiers are presented. A feature vector is extracted from the text pair under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the mathematical regression and genetic algorithm models. Another set of data was used for testing. The results of (MR) and (GA) outperform the results of length based approach. Moreover these new approaches are valid for any languages pair and are quite flexible since the feature vector may contain more, less or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research

    Computer Assisted Language Learning Based on Corpora and Natural Language Processing : The Experience of Project CANDLE

    Get PDF
    This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora and NLP technologies to construct an online English learning environment for learners in Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an English-Chinese parallel corpus, Sinorama, was used as the main course material for reading, writing, and culture-based learning courses. Second, an online bilingual concordancer, TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and other corpora. Third, many online lessons, including extensive reading, verb-noun collocations, and vocabulary, were designed to be used alone or together with TotalRecall and TANGO. Fourth, an online collocation check program, MUST, was developed for detecting V-N miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other computational scaffoldings are under development. It is hoped that this project will help intermediate learners in Taiwan enhance their English proficiency with effective pedagogical approaches and versatile language reference tools

    Evaluation of the Statistical Machine Translation Service for Croatian-English

    Get PDF
    Much thought has been given in an endeavour to formalize the translation process. As a result, various approaches to MT (machine translation) were taken. With the exception of statistical translation, all approaches require cooperation between language and computer science experts. Most of the models use various hybrid approaches. Statistical translation approach is completely language independent if we disregard the fact that it requires huge parallel corpus that needs to be split into sentences and words. This paper compares and discusses state-of-the-art statistical machine translation (SMT) models and evaluation methods. Results of statistically-based Google Translate tool for Croatian-English translations are presented and multilevel analysis is given. Three different types of texts are manually evaluated and results are analysed by the χ2-test

    La enseñanza de la traducción especializada. Corpus textuales de traductores en formación con etiquetado de errores

    Get PDF
    This paper describes the method used in teaching specialised translation in the English Language Translation Master’s programme at Masaryk University. After a brief description of the courses, the focus shifts to translation learner corpora (TLC) compiled in the new Hypal interface, which can be integrated in Moodle. Student translations are automatically aligned (with possible adjustments), PoS (part-of-speech) tagged, and manually error-tagged. Personal student reports based on error statistics for individual translations to show students’ progress throughout the term or during their studies in the four-semester programme can be easily generated. Using the data from the pilot run of the new software, the paper concludes with the first results of the research examining a learner corpus of translations from Czech into English.En el presente trabajo se describe el método que se ha seguido para enseñar traducción especializada en el Máster de Traducción en Lengua Inglesa que se imparte en la Universidad de Masaryk. Tras una breve descripción de las asignaturas, nos centramos en corpus textuales de traductores en formación (translation learner corpora, TLC) recopilado en la nueva interfaz Hypal, que se puede incorporar en Moodle. Las traducciones realizadas por los alumnos se alinean de forma automática (con posibles modificaciones) y reciben un etiquetado gramatical y un etiquetado manual de errores. Es posible generar de manera sencilla informes sobre los alumnos con información estadística sobre errores en las traducciones individuales para mostrar su progreso durante el cuatrimestre o el programa completo. En función de los datos obtenidos en la prueba piloto del nuevo software, este trabajo presenta los primeros resultados del estudio a través de un corpus de traducciones de aprendices del checo al inglés

    Statistical Augmentation of a Chinese Machine-Readable Dictionary

    Get PDF
    We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-specific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary.Comment: 17 pages, uuencoded compressed PostScrip

    Teaching Specialized Translation Error-tagged Translation Learner Corpora

    Get PDF
    This paper describes the method used in teaching specialised translation in the English Language Translation Master’s programme at Masaryk University. After a brief description of the courses, the focus shifts to translation learner corpora (TLC) compiled in the new Hypal interface, which can be integrated in Moodle. Student translations are automatically aligned (with possible adjustments), PoS (part-of-speech) tagged, and manually error-tagged. Personal student reports based on error statistics for individual translations to show students’progress throughout the term or during their studies in the four-semester programme can be easily generated. Using the data from the pilot run of the new software, the paper concludes with the first results of the research examining a learner corpus of translations from Czech into English.Článek popisuje metodu používanou při výuce odborného překladu v magisterském studijním programu Překladatelství anglického jazyka na Masarykově univerzitě. Po krátkém popisu kurzů se zaměřuje na korpus překladů studentů (TLC) sestavený v novém rozhraní programu Hypal, které lze integrovat do systému Moodle. Studentské překlady jsou automaticky zarovnány (s možnými úpravami), označkovány podle slovního druhu a nakonec jsou ručně označeny chyby. Program umožňuje sledovat práci jednotlivých studentů pomocí statistik chyb pro jednotlivé překlady v průběhu semestru nebo během studia ve čtyřsemestrovém programu. Na závěr jsou v článku uvedena data z pilotního běhu nového softwaru s prvními výsledky výzkumu korpusu překladů z češtiny do angličtiny

    Parallel texts alignment

    Get PDF
    Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia InformáticaAlignment of parallel texts (texts that are translation of each other) is a required step for many applications that use parallel texts, including statistical machine translation, automatic extraction of translation equivalents, automatic creation of concordances, etc. This dissertation presents a new methodology for parallel texts alignment that departs from previous work in several ways. One important departure is a shift of goals concerning the use of lexicons for obtaining correspondences between the texts. Previous methods try to infer a bilingual lexicon as part of the alignment process and use it to obtain correspondences between the texts. Some of those methods can use external lexicons to complement the inferred one, but they tend to consider them as secondary. This dissertation presents several arguments supporting the thesis that lexicon inference should not be embedded in the alignment process. The method described complies with this statement and relies exclusively on externally managed lexicons to obtain correspondences. Moreover, the algorithms presented can handle very large lexicons containing terms of arbitrary length. Besides the exclusive use of external lexicons, this dissertation presents a new method for obtaining correspondences between translation equivalents found in the texts. It uses a decision criteria based on features that have been overlooked by prior work. The proposed method is iterative and refines the alignment at each iteration. It uses the alignment obtained in one iteration as a guide to obtaining new correspondences in the next iteration, which in turn are used to compute a finer alignment. This iterative scheme allows the method to correct correspondence errors from previous iterations in face of new information
    corecore