13 research outputs found

    Subtitling for a global audience. Handling the translation of culture-specific items in TEDx talks

    Full text link
    [EN] TED.com is a platform to share ideas through influential talks in video format on topics that range from science and technology to business that engages volunteers from all over the world to help transcribe, subtitle and translate their scripts in more than 100 languages. The justification to engage volunteer transcribers is that transcribed talks can reach a wider audience because they are accessible for hearing impaired individuals, can be indexed in search engines and can achieve TED¿s mission of spreading ideas by making transcripts available for translation through TED¿s Open Translation Project. Therefore, talks transcribers play a crucial role in the overall translation workflow and dissemination process as they are responsible for transcribing the contents and foundations of what will be later on translated into different languages. The objective of this paper is to analyse a corpus of talks originally delivered in different variants of Spanish to identify the most common strategies used by volunteer transcribers to handle local or idiomatic expressions and culturally biased items to reach the maximum audience possible and facilitate translation.Candel-Mora, MÁ.; González Pastor, DM. (2017). Subtitling for a global audience. Handling the translation of culture-specific items in TEDx talks. FORUM. Revue internationale d interprétation et de traduction. International Journal of Interpretation and Translation. 15(2):288-304. doi:10.1075/forum.15.2.07canS28830415

    Generic and Specialized Word Embeddings for Multi-Domain Machine Translation

    Get PDF
    International audienceSupervised machine translation works well when the train and test data are sampled from the same distribution. When this is not the case, adaptation techniques help ensure that the knowledge learned from out-of-domain texts generalises to in-domain sentences. We study here a related setting, multi-domain adaptation, where the number of domains is potentially large and adapting separately to each domain would waste training resources. Our proposal transposes to neural machine translation the feature expansion technique of (Daum\'e III, 2007): it isolates domain-agnostic from domain-specific lexical representations, while sharing the most of the network across domains.Our experiments use two architectures and two language pairs: they show that our approach, while simple and computationally inexpensive, outperforms several strong baselines and delivers a multi-domain system that successfully translates texts from diverse sources

    Semi-Supervised Acoustic Model Training by Discriminative Data Selection from Multiple ASR Systems' Hypotheses

    Get PDF
    While the performance of ASR systems depends on the size of the training data, it is very costly to prepare accurate and faithful transcripts. In this paper, we investigate a semisupervised training scheme, which takes the advantage of huge quantities of unlabeled video lecture archive, particularly for the deep neural network (DNN) acoustic model. In the proposed method, we obtain ASR hypotheses by complementary GMM-and DNN-based ASR systems. Then, a set of CRF-based classifiers is trained to select the correct hypotheses and verify the selected data. The proposed hypothesis combination shows higher quality compared with the conventional system combination method (ROVER). Moreover, compared with the conventional data selection based on confidence measure score, our method is demonstrated more effective for filtering usable data. Significant improvement in the ASR accuracy is achieved over the baseline system and in comparison with the models trained with the conventional system combination and data selection methods

    Phrase-Based Language Model in Statistical Machine Translation

    Get PDF
    La date de publication ne nous a pas encore été communiquéeInternational audienceAs one of the most important modules in statistical machine translation (SMT), language model measures whether one translation hypothesis is more grammatically correct than other hypotheses. Currently the state-of-the-art SMT systems use standard word n-gram models, whereas the translation model is phrase-based. In this paper, the idea is to use a phrase-based language model. For that, target portion of the translation table are retrieved and used to rewrite the training corpus and to calculate a phrase n-gram language model. In this work, weperform experiments with two language models word-based (WBLM) and phrase-based (PBLM). The different SMT are trained with threeoptimization algorithms MERT, MIRA and PRO. Thus, the PBLM systems are compared to the baseline system in terms of BLUE and TER.The experimental results show that the use of a phrase-based language model in SMT can improve results and is especially able to reduce theerror rate

    WIT3: Web Inventory of Transcribed and Translated Talks

    Get PDF
    We describe here a Web inventory named WIT3 that offers access to a collection of transcribed and translated talks. The core of WIT3 is the TED Talks corpus, that basically redistributes the original content published by the TED Conference website (http://www.ted.com). Since 2007, the TED Conference, based in California, has been posting all video recordings of its talks together with subtitles in English and their translations in more than 80 languages. Aside from its cultural and social relevance, this content, which is published under the Creative Commons BY-NC-ND license, also represents a precious language resource for the machine translation research community, thanks to its size, variety of topics, and covered languages. This effort repurposes the original content in a way which is more convenient for machine translation researchers

    The IWSLT Evaluation Campaign: Challenges, Achievements, Future Directions

    Get PDF
    Evaluation campaigns are the most successful modality for promoting the assessment of the state of the art of a field on a specific task. Within the field of Machine Translation (MT), the International Workshop on Spoken Language Translation (IWSLT) is a yearly scientific workshop, associated with an open evaluation campaign on spoken language translation. The IWSLT campaign, which is the only one addressing speech translation, started in 2004 and will feature its 13th installment in 2016. Since its beginning, the campaign attracted around 70 different participating teams from all over the world. In this paper we present the main characteristics of the tasks offered within IWSLT, as well as the evaluation framework adopted and the data made available to the research community. We also analyze and discuss the progress made by the systems along the years for the most addressed and long-standing tasks and we share ideas about new challenging data and interesting application scenarios to test the utility of MT systems in real tasks

    Does more data always yield better translations?

    Full text link
    Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214S15216

    Multilingual Neural Machine Translation for Low-Resource Languages

    Get PDF
    In recent years, Neural Machine Translation (NMT) has been shown to be more effective than phrase-based statistical methods, thus quickly becoming the state of the art in machine translation (MT). However, NMT systems are limited in translating low-resourced languages, due to the significant amount of parallel data that is required to learn useful mappings between languages. In this work, we show how the so-called multilingual NMT can help to tackle the challenges associated with low-resourced language translation. The underlying principle of multilingual NMT is to force the creation of hidden representations of words in a shared semantic space across multiple languages, thus enabling a positive parameter transfer across languages. Along this direction, we present multilingual translation experiments with three languages (English, Italian, Romanian) covering six translation directions, utilizing both recurrent neural networks and transformer (or self-attentive) neural networks. We then focus on the zero-shot translation problem, that is how to leverage multi-lingual data in order to learn translation directions that are not covered by the available training material. To this aim, we introduce our recently proposed iterative self-training method, which incrementally improves a multilingual NMT on a zero-shot direction by just relying on monolingual data. Our results on TED talks data show that multilingual NMT outperforms conventional bilingual NMT, that the transformer NMT outperforms recurrent NMT, and that zero-shot NMT outperforms conventional pivoting methods and even matches the performance of a fully-trained bilingual system

    WIT3: il corpus dei sottotitoli multilingue degli interventi alle conferenze TED

    Get PDF
    In questo lavoro viene presentato WIT3, il sito web che abbiamo sviluppato per distribuire una versione pronta all'uso della collezione di sottotitoli multilingua degli interventi alle conferenze TED. Siamo persuasi che questa collezione rappresenti una risorsa preziosa per la comunita` scientifica che si occupa di traduzione automatica, data la sua dimensione in continua crescita e data la sua varieta` sia in termini di lingue sia di argomenti trattati. Infatti gia` ad oggi, giugno 2013, il sito TED raccoglie la registrazione di piu` di 2000 interventi che spaziano su tutto lo scibile umano, dalla tecnologia all'intrattenimento, dall'economia alla scienza; le trascrizioni in inglese sono gia` disponibili per la maggior parte delle registrazioni, mentre le traduzioni vengono via via aggiunte e al momento coprono fino a 100 lingue diverse. La nostra ambizione e` di fornire attraverso WIT3 un servizio adeguato alla comunita` scientifica distribuendo: (a) per un numero consistente di coppie di lingue il materiale per l'addestramento di sistemi statistici di traduzione e la loro valutazione, insieme a delle traduzioni generate automaticamente che possono fungere da riferimento; (b) i file originali del sito di TED con degli strumenti di elaborazione che consentono a chiunque di preparare autonomamente l'ambiente sperimentale per qualsiasi coppia di lingue

    The IWSLT 2015 Evaluation Campaign

    Get PDF
    The IWSLT 2015 Evaluation Campaign featured three tracks: automatic speech recognition (ASR), spoken language translation (SLT), and machine translation (MT). For ASR we offered two tasks, on English and German, while for SLT and MT a number of tasks were proposed, involving English, German, French, Chinese, Czech, Thai, and Vietnamese. All tracks involved the transcription or translation of TED talks, either made available by the official TED website or by other TEDx events. A notable change with respect to previous evaluations was the use of unsegmented speech in the SLT track in order to better fit a real application scenario. Thus, from one side participants were encouraged to develop advanced methods for sentence segmentation, from the other side organisers had to cope with the automatic evaluation of SLT outputs not matching the sentence-wise arrangement of the human references. A new evaluation server was also developed to allow participants to score their MT and SLT systems on selected dev and test sets. This year 16 teams participated in the evaluation, for a total of 63 primary submissions. All runs were evaluated with objective metrics, and submissions for two of the MT translation tracks were also evaluated with human post-editing
    corecore