6,427 research outputs found

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Statistical Machine Translation of Japanese

    Get PDF
    The purpose of this research was to find ways to improve the performance of a statistical machine translation system that translates text from Japanese to English. Methods included altering the training and test data by adding a prior linguistic knowledge, altering sentence structures, and looking for better ways to statistically alter the way words align between the two languages. In addition, methods for properly segmenting words in Japanese text through statistical methods were examined. Finally, experiments were conducted on Japanese speech to produce the best text transcription of the speech. The best statistical machine translation methods implemented resulted in improvements that rivaled the best evaluations from the 2005 International Workshop on Spoken Language Translation from which training and test data was used. Recommendations, including how the methods presented may be altered for further improvements for future research, are also discussed

    Proceedings of the 17th Annual Conference of the European Association for Machine Translation

    Get PDF
    Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    PersoNER: Persian named-entity recognition

    Full text link
    © 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

    Learner Modelling for Individualised Reading in a Second Language

    Get PDF
    Extensive reading is an effective language learning technique that involves fast reading of large quantities of easy and interesting second language (L2) text. However, graded readers used by beginner learners are expensive and often dull. The alternative is text written for native speakers (authentic text), which is generally too difficult for beginners. The aim of this research is to overcome this problem by developing a computer-assisted approach that enables learners of all abilities to perform effective extensive reading using freely-available text on the web. This thesis describes the research, development and evaluation of a complex software system called FERN that combines learner modelling and iCALL with narrow reading of electronic text. The system incorporates four key components: (1) automatic glossing of difficult words in texts, (2) individualised search engine for locating interesting texts of appropriate difficulty, (3) supplementary exercises for introducing key vocabulary and reviewing difficult words and (4) reliably monitoring reading and reporting progress. FERN was optimised for English speakers learning Spanish, but is easily adapted for learners of others languages. The suitability of the FERN system was evaluated through corpus analysis, machine translation analysis and a year-long study with second year university Spanish class. The machine translation analysis combined with the classroom study demonstrated that the word and phrase error rate generated in FERN is low enough to validate the use of machine translation to automatically generate glosses, but is high enough that a translation dictionary is required as a backup. The classroom study demonstrated that when aided by glosses students can read at over 100 words per minute if they know 95% of the words, whereas compared to the 98% word knowledge required for effective unaided extensive reading. A corpus analysis demonstrated that beginner learners of Spanish can do effective narrow reading of news articles using FERN after learning only 200–300 high-frequency word families, in addition to familiarity with English-Spanish cognates and proper nouns. FERN also reliably monitors reading speeds and word counts, and provides motivating progress reports, which enable teachers to set concrete reading goals that dramatically increase the quantity that students read, as demonstrated in the user study

    Development of an international written communication audit

    Get PDF
    Although localization, internationalization, and globalization efforts to meet international customers\u27 product and information needs are accepted strategies in the US computer industry, the needs of second language (L2) English speakers are less directly addressed in the international workplace. Application of strategies similar to these three technical communication strategies may benefit international workplace communication;The international writing approaches represented by these three communication strategies are related to the global management strategies of organizations (e.g., ethnocentric, polycentric, geocentric). This categorization, based on Perlmutter and Hedlund, considers organizations\u27 strategic missions and can be used to align management strategies with international writing approaches and individual rhetorical strategies. For example, an ethnocentric organization, entering the international market from a broad national base, instead of immediately changing its communication approach, might continue to use its source-localized information to communicate internationally. An organization might enter the global arena with an ethnocentric strategy, and, in reaction to emerging problems, focus on localization for each market and rely heavily on translation and translators, becoming more polycentric in its approach. A geocentric organization, balancing between ethnocentric and polycentric management strategies, is in constant communication across national and language borders, and might use both internationalization and globalization approaches in communication. Organizations\u27 global management strategies should align with their international communication practices, both for customers and in the workplace. An organization seeking a larger role in international ventures, yet with ethnocentric, localized communication strategies, might be less successful than one with similar goals and a more geocentric, globalized communication;In recognition of the diverse needs of organizations and individuals, an assessment method, an International Written Communication Audit (IWCA), is developed in this dissertation. The IWCA, based on linguistic and contrastive rhetoric research, focuses on cultural, pragmatic, and translation issues important to international workplace writing in US-English. The basic IWCA combines internationalization and globalization approaches. A localization module for the PRC is offered as an example of tailoring the audit methodology to the needs of L2 English readers from a specific language group. The construction of a workplace sampling frame and the analysis of the IWCA data are discussed

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    ANNOTATED DISJUNCT FOR MACHINE TRANSLATION

    Get PDF
    Most information found in the Internet is available in English version. However, most people in the world are non-English speaker. Hence, it will be of great advantage to have reliable Machine Translation tool for those people. There are many approaches for developing Machine Translation (MT) systems, some of them are direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses on developing an MT for less resourced languages i.e. languages that do not have available grammar formalism, parser, and corpus, such as some languages in South East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer approaches. Moreover, the unavailability of grammar formalism and parser in the target languages motivates us to develop a hybrid between direct and transfer approaches. This hybrid approach is referred as a hybrid transfer approach. This approach uses the Annotated Disjunct (ADJ) method. This method, based on Link Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and many-to-many word(s) translations. This method consists of transfer rules module which maps source words in a source sentence (SS) into target words in correct position in a target sentence (TS). The developed transfer rules are demonstrated on English → Indonesian translation tasks. An experimental evaluation is conducted to measure the performance of the developed system over available English-Indonesian MT systems. The developed ADJ-based MT system translated simple, compound, and complex English sentences in present, present continuous, present perfect, past, past perfect, and future tenses with better precision than other systems, with the accuracy of 71.17% in Subjective Sentence Error Rate metric
    corecore