6,381 research outputs found
Improving Context-aware Neural Machine Translation with Target-side Context
In recent years, several studies on neural machine translation (NMT) have
attempted to use document-level context by using a multi-encoder and two
attention mechanisms to read the current and previous sentences to incorporate
the context of the previous sentences. These studies concluded that the
target-side context is less useful than the source-side context. However, we
considered that the reason why the target-side context is less useful lies in
the architecture used to model these contexts.
Therefore, in this study, we investigate how the target-side context can
improve context-aware neural machine translation. We propose a weight sharing
method wherein NMT saves decoder states and calculates an attention vector
using the saved states when translating a current sentence. Our experiments
show that the target-side context is also useful if we plug it into NMT as the
decoder state when translating a previous sentence.Comment: 12 pages; PACLING 201
ScrumSourcing: Challenges of Collaborative Post-editing for Rugby World Cup 2019
This paper describes challenges facing the ScrumSourcing project to create a neural machine translation (NMT) service aiding interaction between Japanese- and English-speaking fans during Rugby World Cup 2019 in Japan. This is an example of «domain adaptation». The best training data for adapting NMT is large volumes of translated sentences typical of the domain. In reality, however, such parallel data for rugby does not exist. The problem is compounded by a marked asymmetry between the two languages in conventions for post-match reports; and the almost total absence of in-match commentaries in Japanese. In post-editing the NMT output to incrementally improve quality via retraining, volunteer rugby fans will play a crucial role in determining a new genre in Japanese. To avoid de-motivating the volunteers at the outset we undertake an initial adaptation of the system using terminological data. This paper describes the compilation of this data and its effects on the quality of the systems’ output.Este documento describe los retos a los que se enfrenta el proyecto ScrumSourcing para crear un servicio de traducción automática neuronal (NMT) que ayude a la interacción entre los aficionados de habla japonesa e inglesa durante la Copa Mundial de Rugby de 2019 en Japón. Este es un ejemplo de «adaptación al dominio». Los mejores datos de entrenamiento para adaptar la NMT son grandes volúmenes de oraciones traducidas típicas del dominio. Sin embargo, en la realidad no existen tales datos paralelos para el rugby. El problema se agrava por una marcada asimetría entre las dos lenguas en las convenciones para los informes posteriores al partido y la ausencia casi total de comentarios emitidos en directo durante el partido en japonés. En la post-edición de la producción de la NMT para mejorar de forma incremental la calidad a través del reentrenamiento, los voluntarios aficionados al rugby desempeñarán un papel crucial en la determinación de un nuevo género en japonés. Para evitar desmotivar a los voluntarios desde el principio, emprenderemos una adaptación inicial del sistema utilizando datos terminológicos. Este documento describe la compilación de estos datos y sus efectos en la calidad de la producción de los sistemas
An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation – An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation
Data sparsity is one of the challenges for low-resource language pairs in Neural Machine Translation (NMT). Previous works have presented different approaches for data augmentation, but they mostly require additional resources and obtain low-quality dummy data in the low-resource issue. This paper proposes a simple and effective novel for generating synthetic bilingual data without using external resources as in previous approaches. Moreover, some works recently have shown that multilingual translation or transfer learning can boost the translation quality in low-resource situations. However, for logographic languages such as Chinese or Japanese, this approach is still limited due to the differences in translation units in the vocabularies. Although Japanese texts contain Kanji characters that are derived from Chinese characters, and they are quite homologous in sharp and meaning, the word orders in the sentences of these languages have a big divergence. Our study will investigate these impacts in machine translation. In addition, a combined pre-trained model is also leveraged to demonstrate the efficacy of translation tasks in the more high-resource scenario. Our experiments present performance improvements up to +6.2 and +7.8 BLEU scores over bilingual baseline systems on two low-resource translation tasks from Chinese to Vietnamese and Japanese to Vietnamese
Cultural Adaptation of Recipes
Building upon the considerable advances in Large Language Models (LLMs), we
are now equipped to address more sophisticated tasks demanding a nuanced
understanding of cross-cultural contexts. A key example is recipe adaptation,
which goes beyond simple translation to include a grasp of ingredients,
culinary techniques, and dietary preferences specific to a given culture. We
introduce a new task involving the translation and cultural adaptation of
recipes between Chinese and English-speaking cuisines. To support this
investigation, we present CulturalRecipes, a unique dataset comprised of
automatically paired recipes written in Mandarin Chinese and English. This
dataset is further enriched with a human-written and curated test set. In this
intricate task of cross-cultural recipe adaptation, we evaluate the performance
of various methods, including GPT-4 and other LLMs, traditional machine
translation, and information retrieval techniques. Our comprehensive analysis
includes both automatic and human evaluation metrics. While GPT-4 exhibits
impressive abilities in adapting Chinese recipes into English, it still lags
behind human expertise when translating English recipes into Chinese. This
underscores the multifaceted nature of cultural adaptations. We anticipate that
these insights will significantly contribute to future research on
culturally-aware language models and their practical application in culturally
diverse contexts.Comment: Accepted to TAC
Recipe instruction semantics corpus (RISeC) : resolving semantic structure and zero anaphora in recipes
We propose a newly annotated dataset for information extraction on recipes. Unlike previous approaches to machine comprehension of procedural texts, we avoid a priori pre-defining domain-specific predicates to recognize (e.g., the primitive instructionsin MILK) and focus on basic understanding of the expressed semantics rather than directly reduce them to a simplified state representation (e.g., ProPara). We thus frame the semantic comprehension of procedural text such as recipes, as fairly generic NLP subtasks, covering (i) entity recognition (ingredients, tools and actions), (ii) relation extraction (what ingredients and tools are involved in the actions), and (iii) zero anaphora resolution (link actions to implicit arguments, e.g., results from previous recipe steps). Further, our Recipe Instruction Semantic Corpus (RISeC) dataset includes textual descriptions for the zero anaphora, to facilitate language generation thereof. Besides the dataset itself, we contribute a pipeline neural architecture that addresses entity and relation extractionas well an identification of zero anaphora. These basic building blocks can facilitate more advanced downstream applications (e.g., question answering, conversational agents)
An experimental framework for designing document structure for users' decision making -- An empirical study of recipes
Textual documents need to be of good quality to ensure effective asynchronous
communication in remote areas, especially during the COVID-19 pandemic.
However, defining a preferred document structure (content and arrangement) for
improving lay readers' decision-making is challenging. First, the types of
useful content for various readers cannot be determined simply by gathering
expert knowledge. Second, methodologies to evaluate the document's usefulness
from the user's perspective have not been established. This study proposed the
experimental framework to identify useful contents of documents by aggregating
lay readers' insights. This study used 200 online recipes as research subjects
and recruited 1,340 amateur cooks as lay readers. The proposed framework
identified six useful contents of recipes. Multi-level modeling then showed
that among the six identified contents, suitable ingredients or notes arranged
with a subheading at the end of each cooking step significantly increased
recipes' usefulness. Our framework contributes to the communication design via
documents
- …