Search CORE

807 research outputs found

Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation

Author: Costa-jussà Marta R.
Dale David
Elbayad Maha
Yu Bokai
Publication venue
Publication date: 11/11/2023
Field of study

Added toxicity in the context of translation refers to the fact of producing a translation output with more toxicity than there exists in the input. In this paper, we present MinTox which is a novel pipeline to identify added toxicity and mitigate this issue which works at inference time. MinTox uses a toxicity detection classifier which is multimodal (speech and text) and works in languages at scale. The mitigation method is applied to languages at scale and directly in text outputs. MinTox is applied to SEAMLESSM4T, which is the latest multimodal and massively multilingual machine translation system. For this system, MinTox achieves significant added toxicity mitigation across domains, modalities and language directions. MinTox manages to approximately filter out from 25% to 95% of added toxicity (depending on the modality and domain) while keeping translation quality

arXiv.org e-Print Archive

Automatically generation and evaluation of Stop words list for Chinese Patents

Author: Na Deng
Xu Chen
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/12/2015
Field of study

As an important preprocessing step of information retrieval and information processing, the accuracy of stop words’ elimination directly influences the ultimate result of retrieval and mining. In information retrieval, stop words’ elimination can compress the storage space of index, and in text mining, it can reduce the dimension of vector space enormously, save the storage space of vector space and speed up the calculation. However, Chinese patents are a kind of legal documents containing technical information, and the general Chinese stop words list is not applicable for them. This paper advances two methodologies for Chinese patents. One is based on word frequency and the other on statistics. Through experiments on real patents data, these two methodologies’ accuracy are compared under several corpuses with different scale, and also compared with general stop list. The experiment result indicates that both of these two methodologies can extract the stop words suitable for Chinese patents and the accuracy of Methodology based on statistics is a little higher than the one based on word frequency

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Toxicity in Multilingual Machine Translation at Scale

Author: Costa-jussà Marta R.
Escolano Carlos
Ferrando Javier
Licht Daniel
Maillard Jean
Ropers Christophe
Smith Eric
Publication venue
Publication date: 05/04/2023
Field of study

Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. An automatic toxicity evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. Making use of the input attributions allows us to explain toxicity, because the source contributions significantly correlate with toxicity for 84% of languages studied. Given our findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations

arXiv.org e-Print Archive

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Author: Chen Jiajun
Dong Qingxiu
Huang Shujian
Kong Lingpeng
Li Lei
Liu Hongyi
Xu Jingjing
Zhu Wenhao
Publication venue
Publication date: 10/04/2023
Field of study

Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT). In this paper, we systematically investigate the advantages and challenges of LLMs for MMT by answering two questions: 1) How well do LLMs perform in translating a massive number of languages? 2) Which factors affect LLMs' performance in translation? We evaluate popular LLMs, including XGLM, OPT, BLOOMZ, and ChatGPT, on 102 languages. Our empirical results show that even the best model ChatGPT still lags behind the supervised baseline NLLB in 83.33% of translation directions. Through further analysis, we discover that LLMs exhibit new working patterns when used for MMT. First, prompt semantics can surprisingly be ignored when given in-context exemplars, where LLMs still show strong performance even with unreasonable prompts. Second, cross-lingual exemplars can provide better task instruction for low-resource translation than exemplars in the same language pairs. Third, we observe the overestimated performance of BLOOMZ on dataset Flores-101, indicating the potential risk when using public datasets for evaluation

arXiv.org e-Print Archive

Chinese eco-films and their pastoral myth

Author: Zhai Runlei
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2015
Field of study

This dissertation is a cross-cultural study of Chinese ecocinema after 1978. It begins by introducing the Hollywood practice in simplifying the conflicts between garden and machine, anthropocentrism and ecocentrism, and Tityrus and Meliboeus in American pastoralism, then explains why such simplification does not work on Chinese screen, and finally studies how Chinese filmmakers reconstruct their pastoral myth in three major steps: first, to recognize Chinese social and cultural realities; second, to establish the human-nature connection, and third, to affirm the nature-culture unity. The conclusion is that Chinese eco-cinema exists in a hybrid form. While Hollywood influences Chinese eco-cinema in terms of production, promotion, and distribution, it manages to develop its own voice by reconstructing a pastoral myth that Chinese audiences could understand and appreciate. It differs from the Hollywood version by creating some tragic, everyday heroes who may seem powerless in protecting or retrieving the pastoral garden, and yet maintain a strong life force not to give up their pastoral faith which has its root in both the human-nature connection and the nature-culture unity

Purdue E-Pubs

Laments and Relational Personhood: Case studies from Duna and Awiakay societies of Papua New Guinea

Author: Gillespie Kirsty
Hoenigman Darja
Publication venue: 'ANU Press'
Publication date: 15/11/2020
Field of study

The Australian National University

Example based English to Bengali machine translation

Author: Salam Khan Md. Anwarus
Publication venue: BRAC University
Publication date
Field of study

This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2008.Cataloged from PDF version of thesis report.Includes bibliographical references (page 31).In this thesis we propose a new architecture for example based English to Bengali machine translation. The proposed Example Based Machine Translation (EBMT) system has five steps: 1) Tagging 2) Parsing 3) Prepare the chunks of the sentence using sub-sentential EBMT 4) Using an efficient adapting scheme match the sentence rule 5) Translate from English to Bengali in the chunk and generate output with morphological analysis. We prepared our tag set for tagging the English sentence. Here we proposed an optimal adapting scheme for choosing sentence rule from the knowledge base of the EBMT system. Our current system can translate simple sentences. We also defined a way to translate a complex sentence using sub-sentential EBMT. As this system can add more rules in the knowledge base, eventually it can be use for general purpose English to Bengali machine translation.Khan Md. Anwarus SalamB. Computer Science and Engineering

BRAC University Institutional Repository