807 research outputs found
Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation
Added toxicity in the context of translation refers to the fact of producing
a translation output with more toxicity than there exists in the input. In this
paper, we present MinTox which is a novel pipeline to identify added toxicity
and mitigate this issue which works at inference time. MinTox uses a toxicity
detection classifier which is multimodal (speech and text) and works in
languages at scale. The mitigation method is applied to languages at scale and
directly in text outputs. MinTox is applied to SEAMLESSM4T, which is the latest
multimodal and massively multilingual machine translation system. For this
system, MinTox achieves significant added toxicity mitigation across domains,
modalities and language directions. MinTox manages to approximately filter out
from 25% to 95% of added toxicity (depending on the modality and domain) while
keeping translation quality
Automatically generation and evaluation of Stop words list for Chinese Patents
As an important preprocessing step of information retrieval and information processing, the accuracy of stop words’ elimination directly influences the ultimate result of retrieval and mining. In information retrieval, stop words’ elimination can compress the storage space of index, and in text mining, it can reduce the dimension of vector space enormously, save the storage space of vector space and speed up the calculation. However, Chinese patents are a kind of legal documents containing technical information, and the general Chinese stop words list is not applicable for them. This paper advances two methodologies for Chinese patents. One is based on word frequency and the other on statistics. Through experiments on real patents data, these two methodologies’ accuracy are compared under several corpuses with different scale, and also compared with general stop list. The experiment result indicates that both of these two methodologies can extract the stop words suitable for Chinese patents and the accuracy of Methodology based on statistics is a little higher than the one based on word frequency
Toxicity in Multilingual Machine Translation at Scale
Machine Translation systems can produce different types of errors, some of
which are characterized as critical or catastrophic due to the specific
negative impact that they can have on users. In this paper we focus on one type
of critical error: added toxicity. We evaluate and analyze added toxicity when
translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences,
covering 13 demographic axes) from English into 164 languages. An automatic
toxicity evaluation shows that added toxicity across languages varies from 0%
to 5%. The output languages with the most added toxicity tend to be
low-resource ones, and the demographic axes with the most added toxicity
include sexual orientation, gender and sex, and ability. We also perform human
evaluation on a subset of 8 translation directions, confirming the prevalence
of true added toxicity. We use a measurement of the amount of source
contribution to the translation, where a low source contribution implies
hallucination, to interpret what causes toxicity. Making use of the input
attributions allows us to explain toxicity, because the source contributions
significantly correlate with toxicity for 84% of languages studied. Given our
findings, our recommendations to reduce added toxicity are to curate training
data to avoid mistranslations, mitigate hallucination and check unstable
translations
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis
Large language models (LLMs) have demonstrated remarkable potential in
handling multilingual machine translation (MMT). In this paper, we
systematically investigate the advantages and challenges of LLMs for MMT by
answering two questions: 1) How well do LLMs perform in translating a massive
number of languages? 2) Which factors affect LLMs' performance in translation?
We evaluate popular LLMs, including XGLM, OPT, BLOOMZ, and ChatGPT, on 102
languages. Our empirical results show that even the best model ChatGPT still
lags behind the supervised baseline NLLB in 83.33% of translation directions.
Through further analysis, we discover that LLMs exhibit new working patterns
when used for MMT. First, prompt semantics can surprisingly be ignored when
given in-context exemplars, where LLMs still show strong performance even with
unreasonable prompts. Second, cross-lingual exemplars can provide better task
instruction for low-resource translation than exemplars in the same language
pairs. Third, we observe the overestimated performance of BLOOMZ on dataset
Flores-101, indicating the potential risk when using public datasets for
evaluation
Chinese eco-films and their pastoral myth
This dissertation is a cross-cultural study of Chinese ecocinema after 1978. It begins by introducing the Hollywood practice in simplifying the conflicts between garden and machine, anthropocentrism and ecocentrism, and Tityrus and Meliboeus in American pastoralism, then explains why such simplification does not work on Chinese screen, and finally studies how Chinese filmmakers reconstruct their pastoral myth in three major steps: first, to recognize Chinese social and cultural realities; second, to establish the human-nature connection, and third, to affirm the nature-culture unity. The conclusion is that Chinese eco-cinema exists in a hybrid form. While Hollywood influences Chinese eco-cinema in terms of production, promotion, and distribution, it manages to develop its own voice by reconstructing a pastoral myth that Chinese audiences could understand and appreciate. It differs from the Hollywood version by creating some tragic, everyday heroes who may seem powerless in protecting or retrieving the pastoral garden, and yet maintain a strong life force not to give up their pastoral faith which has its root in both the human-nature connection and the nature-culture unity
Example based English to Bengali machine translation
This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2008.Cataloged from PDF version of thesis report.Includes bibliographical references (page 31).In this thesis we propose a new architecture for example based English to Bengali machine
translation. The proposed Example Based Machine Translation (EBMT) system has five
steps: 1) Tagging 2) Parsing 3) Prepare the chunks of the sentence using sub-sentential
EBMT 4) Using an efficient adapting scheme match the sentence rule 5) Translate from
English to Bengali in the chunk and generate output with morphological analysis. We
prepared our tag set for tagging the English sentence. Here we proposed an optimal
adapting scheme for choosing sentence rule from the knowledge base of the EBMT
system. Our current system can translate simple sentences. We also defined a way to
translate a complex sentence using sub-sentential EBMT. As this system can add more
rules in the knowledge base, eventually it can be use for general purpose English to
Bengali machine translation.Khan Md. Anwarus SalamB. Computer Science and Engineering
- …