25 research outputs found
Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation
In this paper, we introduce a data-driven approach for Formality-Sensitive
Machine Translation (FSMT) that caters to the unique linguistic properties of
four target languages. Our methodology centers on two core strategies: 1)
language-specific data handling, and 2) synthetic data generation using
large-scale language models and empirical prompt engineering. This approach
demonstrates a considerable improvement over the baseline, highlighting the
effectiveness of data-centric techniques. Our prompt engineering strategy
further improves performance by producing superior synthetic translation
examples.Comment: Accepted for Data-centric Machine Learning Research (DMLR) Workshop
at ICML 202
Alternative Speech: Complementary Method to Counter-Narrative for Better Discourse
We introduce the concept of "Alternative Speech" as a new way to directly
combat hate speech and complement the limitations of counter-narrative. An
alternative speech provides practical alternatives to hate speech in real-world
scenarios by offering speech-level corrections to speakers while considering
the surrounding context and promoting speakers to reform. Further, an
alternative speech can combat hate speech alongside counter-narratives,
offering a useful tool to address social issues such as racial discrimination
and gender inequality. We propose the new concept and provide detailed
guidelines for constructing the necessary dataset. Through discussion, we
demonstrate that combining alternative speech and counter-narrative can be a
more effective strategy for combating hate speech by complementing specificity
and guiding capacity of counter-narrative. This paper presents another
perspective for dealing with hate speech, offering viable remedies to
complement the constraints of current approaches to mitigating harmful bias.Comment: Accepted for The First Workshop on Data-Centric AI (DCAI) at ICDM
202
A Self-Supervised Automatic Post-Editing Data Generation Tool
Data building for automatic post-editing (APE) requires extensive and
expert-level human effort, as it contains an elaborate process that involves
identifying errors in sentences and providing suitable revisions. Hence, we
develop a self-supervised data generation tool, deployable as a web
application, that minimizes human supervision and constructs personalized APE
data from a parallel corpus for several language pairs with English as the
target language. Data-centric APE research can be conducted using this tool,
involving many language pairs that have not been studied thus far owing to the
lack of suitable data.Comment: Accepted for DataPerf workshop at ICML 202
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline
Automatic speech recognition (ASR) outcomes serve as input for downstream
tasks, substantially impacting the satisfaction level of end-users. Hence, the
diagnosis and enhancement of the vulnerabilities present in the ASR model bear
significant importance. However, traditional evaluation methodologies of ASR
systems generate a singular, composite quantitative metric, which fails to
provide comprehensive insight into specific vulnerabilities. This lack of
detail extends to the post-processing stage, resulting in further obfuscation
of potential weaknesses. Despite an ASR model's ability to recognize utterances
accurately, subpar readability can negatively affect user satisfaction, giving
rise to a trade-off between recognition accuracy and user-friendliness. To
effectively address this, it is imperative to consider both the speech-level,
crucial for recognition accuracy, and the text-level, critical for
user-friendliness. Consequently, we propose the development of an Error
Explainable Benchmark (EEB) dataset. This dataset, while considering both
speech- and text-level, enables a granular understanding of the model's
shortcomings. Our proposition provides a structured pathway for a more
`real-world-centric' evaluation, a marked shift away from abstracted,
traditional methods, allowing for the detection and rectification of nuanced
system weaknesses, ultimately aiming for an improved user experience.Comment: Accepted for Data-centric Machine Learning Research (DMLR) Workshop
at ICML 202
QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation
With the recent advance in neural machine translation demonstrating its
importance, research on quality estimation (QE) has been steadily progressing.
QE aims to automatically predict the quality of machine translation (MT) output
without reference sentences. Despite its high utility in the real world, there
remain several limitations concerning manual QE data creation: inevitably
incurred non-trivial costs due to the need for translation experts, and issues
with data scaling and language expansion. To tackle these limitations, we
present QUAK, a Korean-English synthetic QE dataset generated in a fully
automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and
QUAK-H, produced through three strategies that are relatively free from
language constraints. Since each strategy requires no human effort, which
facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M
for QUAK-M. As an experiment, we quantitatively analyze word-level QE results
in various ways while performing statistical analysis. Moreover, we show that
datasets scaled in an efficient way also contribute to performance improvements
by observing meaningful performance gains in QUAK-M, P when adding data up to
1.58M
Decoding Strategies for Improving Low-Resource Machine Translation
Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance
Who Speaks Like a Style of Vitamin:Towards Syntax-Aware Dialogue Summarization Using Multi-Task Learning
Abstractive dialogue summarization is a challenging task for several reasons.
First, most of the important pieces of information in a conversation are
scattered across utterances through multi-party interactions with different
textual styles. Second, dialogues are often informal structures, wherein
different individuals express personal perspectives, unlike text summarization,
tasks that usually target formal documents such as news articles. To address
these issues, we focused on the association between utterances from individual
speakers and unique syntactic structures. Speakers have unique textual styles
that can contain linguistic information, such as voiceprint. Therefore, we
constructed a syntax-aware model by leveraging linguistic information (i.e.,
POS tagging), which alleviates the above issues by inherently distinguishing
sentences uttered from individual speakers. We employed multi-task learning of
both syntax-aware information and dialogue summarization. To the best of our
knowledge, our approach is the first method to apply multi-task learning to the
dialogue summarization task. Experiments on a SAMSum corpus (a large-scale
dialogue summarization corpus) demonstrated that our method improved upon the
vanilla model. We further analyze the costs and benefits of our approach
relative to baseline models.Comment: This work has been accepted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Static Sound Event Localization and Detection Using Bipartite Matching Loss for Emergency Monitoring
In this paper, we propose a method for estimating the classes and directions of static audio objects using stereo microphones in a drone environment. Drones are being increasingly used across various fields, with the integration of sensors such as cameras and microphones, broadening their scope of application. Therefore, we suggest a method that attaches stereo microphones to drones for the detection and direction estimation of specific emergency monitoring. Specifically, the proposed neural network is configured to estimate fixed-size audio predictions and employs bipartite matching loss for comparison with actual audio objects. To train the proposed network structure, we built an audio dataset related to speech and drones in an outdoor environment. The proposed technique for identifying and localizing sound events, based on the bipartite matching loss we proposed, works better than those of the other teams in our group
Analysis of the Effectiveness of Model, Data, and User-Centric Approaches for Chat Application: A Case Study of BlenderBot 2.0
BlenderBot 2.0 represents a significant advancement in open-domain chatbots by incorporating real-time information and retaining user information across multiple sessions through an internet search module. Despite its innovations, there are still areas for improvement. This paper examines BlenderBot 2.0’s limitations and errors from three perspectives: model, data, and user interaction. From the data perspective, we highlight the challenges associated with the crowdsourcing process, including unclear guidelines for workers, insufficient measures for filtering hate speech, and the lack of a robust process for verifying the accuracy of internet-sourced information. From the user perspective, we identify nine types of limitations and conduct a thorough investigation into their causes. For each perspective, we propose practical methods for improvement and discuss potential directions for future research. Additionally, we extend our analysis to include perspectives in the era of large language models (LLMs), further broadening our understanding of the challenges and opportunities present in current AI technologies. This multifaceted analysis not only sheds light on BlenderBot 2.0’s current limitations but also charts a path forward for the development of more sophisticated and reliable open-domain chatbots within the broader context of LLM advancements
AI Student: A Machine Reading Comprehension System for the Korean College Scholastic Ability Test
Machine reading comprehension is a question answering mechanism in which a machine reads, understands, and answers questions from a given text. These reasoning skills can be sufficiently grafted into the Korean College Scholastic Ability Test (CSAT) to bring about new scientific and educational advances. In this paper, we propose a novel Korean CSAT Question and Answering (KCQA) model and effectively utilize four easy data augmentation strategies with round trip translation to augment the insufficient data in the training dataset. To evaluate the effectiveness of KCQA, 30 students appeared for the test under conditions identical to the proposed model. Our qualitative and quantitative analysis along with experimental results revealed that KCQA achieved better performance than humans with a higher F1 score of 3.86