6,930 research outputs found

    Neural Machine Translation into Language Varieties

    Full text link
    Both research and commercial machine translation have so far neglected the importance of properly handling the spelling, lexical and grammar divergences occurring among language varieties. Notable cases are standard national varieties such as Brazilian and European Portuguese, and Canadian and European French, which popular online machine translation services are not keeping distinct. We show that an evident side effect of modeling such varieties as unique classes is the generation of inconsistent translations. In this work, we investigate the problem of training neural machine translation from English to specific pairs of language varieties, assuming both labeled and unlabeled parallel texts, and low-resource conditions. We report experiments from English to two pairs of dialects, EuropeanBrazilian Portuguese and European-Canadian French, and two pairs of standardized varieties, Croatian-Serbian and Indonesian-Malay. We show significant BLEU score improvements over baseline systems when translation into similar languages is learned as a multilingual task with shared representations.Comment: Published at EMNLP 2018: third conference on machine translation (WMT 2018

    Analysis of errors in the automatic translation of questions for translingual QA systems

    Get PDF
    Purpose – This study aims to focus on the evaluation of systems for the automatic translation of questions destined to translingual question-answer (QA) systems. The efficacy of online translators when performing as tools in QA systems is analysed using a collection of documents in the Spanish language. Design/methodology/approach – Automatic translation is evaluated in terms of the functionality of actual translations produced by three online translators (Google Translator, Promt Translator, and Worldlingo) by means of objective and subjective evaluation measures, and the typology of errors produced was identified. For this purpose, a comparative study of the quality of the translation of factual questions of the CLEF collection of queries was carried out, from German and French to Spanish. Findings – It was observed that the rates of error for the three systems evaluated here are greater in the translations pertaining to the language pair German-Spanish. Promt was identified as the most reliable translator of the three (on average) for the two linguistic combinations evaluated. However, for the Spanish-German pair, a good assessment of the Google online translator was obtained as well. Most errors (46.38 percent) tended to be of a lexical nature, followed by those due to a poor translation of the interrogative particle of the query (31.16 percent). Originality/value – The evaluation methodology applied focuses above all on the finality of the translation. That is, does the resulting question serve as effective input into a translingual QA system? Thus, instead of searching for “perfection”, the functionality of the question and its capacity to lead one to an adequate response are appraised. The results obtained contribute to the development of improved translingual QA systems

    Stylization and representation in subtitles: can less be more?

    Get PDF
    This article considers film dialogues and interlingual subtitles from the point of view of linguistic and cultural representation, and revisits from that perspective the question of loss, as a platform for considering alternative views on the topic and broader theoretical issues. The cross-cultural pragmatics perspective and focus on viewers’ reactions that dealing with representation entails cast the question of loss in a different light and opens up avenues for alternative modes of analysis. They make room for subtitles to be construed as producing their own systems of multimodal textual representation and modes of interpretation, and for their text to be recognised as having a greater expressive and representational potential than face values might suggest. This is the argument, informed by Fowler's Theory of Mode (1991, 2000), that is taken up in the paper, and harnessed to the review of examples or observations from recent studies on subtitles, and complementary evidence from dubbing. The capacity of subtitles to produce insights into the cultures and languages represented is of particular interest, and has wider implications for the culturally instrumental functions of subtitles and translation strategies

    Psychotherapy across languages: beliefs, attitudes and practices of monolingual and multilingual therapists with their multilingual patients

    Get PDF
    The present study investigates beliefs, attitudes and practices of 101 monolingual and multilingual therapists in their interactions with multilingual patients. We adopted a mixed-method approach, using an on-line questionnaire with 27 closed questions which were analysed quantitatively and informed questions in interviews with one monolingual and two multilingual therapists. A principal component analysis yielded a four-factor solution accounting for 41% of the variance. The first dimension, which explained 17% of variance, reflects therapists’ attunement towards their bilingual patients (i.e., attunement versus collusion). Further analysis showed that the 18 monolingual therapists differed significantly from their 83 bi- or multilingual peers on this dimension. The follow up interviews confirmed this result. Recommendations based on these findings are made for psychotherapy training and supervision to attend to a range of issues including: the psychological and therapeutic functions of multi/bilingualism; practice in making formulations in different languages; the creative therapeutic potential of the language gap

    Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

    Full text link
    Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that \textbf{a)} most CSW data involves English ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and filtering stages shadow the representativeness of CSW data sets. We conclude by providing a short check-list to improve the representativeness for forthcoming studies involving CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings

    Super-diversity discourse

    Get PDF

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation
    • …
    corecore