658 research outputs found

    Filling Knowledge Gaps in a Broad-Coverage Machine Translation System

    Full text link
    Knowledge-based machine translation (KBMT) techniques yield high quality in domains with detailed semantic models, limited vocabulary, and controlled input grammar. Scaling up along these dimensions means acquiring large knowledge resources. It also means behaving reasonably when definitive knowledge is not yet available. This paper describes how we can fill various KBMT knowledge gaps, often using robust statistical techniques. We describe quantitative and qualitative results from JAPANGLOSS, a broad-coverage Japanese-English MT system.Comment: 7 pages, Compressed and uuencoded postscript. To appear: IJCAI-9

    Operationalization of interactive Multilingual Access Gateways (iMAGs) in the Traouiero project

    No full text
    International audienceWe will explain and demonstrate iMAGs (interactive Multilingual Access Gateways), in particular on a scientific laboratory web site and on the Greater Grenoble (La Métro) web site. This bilingual presentation has been obtained using an iMAG. Presentation This presentation is an adaptation and update of an article presented as a demonstration only to TALN-2010. The names of the files have been kept the same, although their contents are slightly different. The iMAG concept has been proposed by Ch. Boitet and V. Bellynck in 2006 (Boitet & al. 2008, Boitet & al. 2005), and reached prototype status in November 2008, with a first demonstration on the LIG laboratory Web site. It has been adapted to the DSR (Digital Silk Road) Web site in April 2009, and then to more than 50 other Web sites. These first prototypes are extensions of the SECTra_w (Huynh & al. 2008) online translation corpora support system. Since the beginning of 2011, we are operationalizing this software with a view to deploy it as a multilingual access infrastructure, in the context of the French ANR (National Agency for Research) Traouiero " emergence " project. An iMAG is an interactive Multilingual Access Gateway very much like Google Translate at first sight: one gives it a URL (starting Web site) and an access language and then navigates in that access language. When the cursor hovers over a segment (usually a sentence or a title), a palette shows the source segment and proposes to contribute by correcting the target segment, in effect post-editing an MT result. With Google Translate, the page does not change after contribution, and if another page contains the same segment, its translation is still the rough MT result, not the polished post-edited version. The more recent Google Translation Toolkit enables one to MT-translate and then post-edit online full Web pages from sites such as Wikipedia, but again the corrected segments don't appear when one later browses the Wikipedia page in the access language. By contrast, an iMAG is dedicated to an elected Web site, or rather to the elected sublanguage defined by one or more URLs and their textual content. It contains a translation memory (TM) and a specific, preterminological dictionary (pTD), both dedicated to the elected sublanguage. Segments are pretranslated not by a unique MT system, but by a (selectable) set of MT systems. Systran and Google are mainly used now, but specialized systems developed from the postedited part of the TM, and based on Moses, will be also used in the future. The powerful online contributive platforms SECTra_w and PIVAX (Nguyen & al. 2007) are used to support the TMs and pTDs. Translated pages are built with the best segment translations available so far. While reading a translated page, it is possible not only to contribute to the segment under the cursor, but also to seamlessly switch to SECTra_w online post-editing environment, equipped with proactive dictionary help and good filtering and search-and-replace functions, and then back to the reading context. A translation relay is being implemented to define the iMAGs or other translation gateways used by an elected Web site, select and parameterize the MT systems and translation routes used for various language pairs, and manage users, groups, projects (some contributions may be organized, other opportunistic), and access rights. Finally, MT systems tailored to the selected sublanguage can be built (by combinations of empirical and expert methods) from the TM and the pTD dedicated to a given elected Web site. That approach will inherently raise the linguistic and terminological quality of the MT results, hopefully converting them from rough into raw translations. The demonstration will use some iMAGs created by the AXiMAG startup for various Web sites, such as those of the LIG lab (http://service.aximag.fr:8180/xwiki/bin/view/imag/liglab) and of La Metro (Greater Grenoble) web site (http://service.aximag.fr:8180/xwiki/bin/view/imag/lametro), where access in Chinese and English was enabled in 2010 for the Shanghai Expo

    A tool for facilitating OCR postediting in historical documents

    Get PDF
    Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention

    Neural Network Architecture for Credibility Assessment of Textual Claims

    Full text link
    Text articles with false claims, especially news, have recently become aggravating for the Internet users. These articles are in wide circulation and readers face difficulty discerning fact from fiction. Previous work on credibility assessment has focused on factual analysis and linguistic features. The task's main challenge is the distinction between the features of true and false articles. In this paper, we propose a novel approach called Credibility Outcome (CREDO) which aims at scoring the credibility of an article in an open domain setting. CREDO consists of different modules for capturing various features responsible for the credibility of an article. These features includes credibility of the article's source and author, semantic similarity between the article and related credible articles retrieved from a knowledge base, and sentiments conveyed by the article. A neural network architecture learns the contribution of each of these modules to the overall credibility of an article. Experiments on Snopes dataset reveals that CREDO outperforms the state-of-the-art approaches based on linguistic features.Comment: Best Paper Award at 19th International Conference on Computational Linguistics and Intelligent Text Processing, March 2018, Hanoi, Vietna

    Machine Translation and Audiovisual products : A case study

    Get PDF
    Much has been said and written about the effects that machine translation (MT) is having on all kinds of published products. This paper discusses the introduction of MT in the localisation of audiovisual products in general and particularly voiceover documentaries. Incorporating MT into the translation of voiceover documentaries would boost the dissemination of non-commercial or minority products, and could enhance the spread of culture. While it might at first seem that MT could be easily integrated into translation for documentaries, some particular aspects make it difficult to use MT to translate for dubbing or for voice-overs. We therefore designed an exploratory study using a corpus containing different texts of a film, in order to compare the results of different automatic measures. The preliminary results show that different results are obtained for different types of speech and that the application of automatic metrics produces similar results. In this article, we will present furthermore the methodological design, which might be considered useful for other studies of this kin

    Second language learning: finding ways to successfully integrate ICT resources and right strategies for language learning, translation and interpreting

    Get PDF
    Second language learning: finding ways to integrate ICT resources and right strategies for language learning, translation and interpreting Second language learning has gained importance as language accreditations have become imperative for any profession or academic career. Undergraduate students in Philology, Translation Studies, Tourism Studies or the like follow language accreditation programmes in order to be able to compile a valid and solid CV when they complete their degrees, master or PhD programmes. In the case of Translation students, they are subjects with strong motivation for language learning applied to translation or interpreting tasks. Language technologies and tools constitute an essential part of their learning processes and language teachers should find a way of optimising the use of these resources. For this purpose, we have conducted a survey among students, trying to find out which web resources they use, how they use them (or not) and why. Using these data, we considered new strategies to help students get the most out of these tools; in particular, we analysed the pros and cons of machine translation tools, such as deepL and Google Translator, as well as corpus linguistics tools.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec
    corecore