4 research outputs found
Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline
Automatic speech recognition (ASR) outcomes serve as input for downstream
tasks, substantially impacting the satisfaction level of end-users. Hence, the
diagnosis and enhancement of the vulnerabilities present in the ASR model bear
significant importance. However, traditional evaluation methodologies of ASR
systems generate a singular, composite quantitative metric, which fails to
provide comprehensive insight into specific vulnerabilities. This lack of
detail extends to the post-processing stage, resulting in further obfuscation
of potential weaknesses. Despite an ASR model's ability to recognize utterances
accurately, subpar readability can negatively affect user satisfaction, giving
rise to a trade-off between recognition accuracy and user-friendliness. To
effectively address this, it is imperative to consider both the speech-level,
crucial for recognition accuracy, and the text-level, critical for
user-friendliness. Consequently, we propose the development of an Error
Explainable Benchmark (EEB) dataset. This dataset, while considering both
speech- and text-level, enables a granular understanding of the model's
shortcomings. Our proposition provides a structured pathway for a more
`real-world-centric' evaluation, a marked shift away from abstracted,
traditional methods, allowing for the detection and rectification of nuanced
system weaknesses, ultimately aiming for an improved user experience.Comment: Accepted for Data-centric Machine Learning Research (DMLR) Workshop
at ICML 202
Towards Harnessing the Most of ChatGPT for Korean Grammatical Error Correction
In this study, we conduct a pioneering and comprehensive examination of ChatGPT’s (GPT-3.5 Turbo) capabilities within the realm of Korean Grammatical Error Correction (K-GEC). Given the Korean language’s agglutinative nature and its rich linguistic intricacies, the task of accurately correcting errors while preserving Korean-specific sentiments is notably challenging. Utilizing a systematic categorization of Korean grammatical errors, we delve into a meticulous, case-specific analysis to identify the strengths and limitations of a ChatGPT-based correction system. We also critically assess influential parameters like temperature and specific error criteria, illuminating potential strategies to enhance ChatGPT’s efficacy in K-GEC tasks. Our findings offer valuable contributions to the expanding domain of NLP research centered on the Korean language
Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction
In a Data-Centric AI paradigm, the model performance is enhanced without altering the model architecture, as evidenced by real-world and benchmark dataset demonstrations. With the advancements of large language models (LLM), it has become increasingly feasible to generate high-quality synthetic data, while considering the need to construct fully synthetic datasets for real-world data containing numerous personal information. However, in-depth validation of the solely synthetic data setting has yet to be conducted, despite the increased possibility of models trained exclusively on fully synthetic data emerging in the future. Therefore, we examined the question, “Do data quality control techniques (known to positively impact data-centric AI) consistently aid models trained exclusively on synthetic datasets?”. To explore this query, we performed detailed analyses using synthetic datasets generated for speech recognition postprocessing using the BackTranScription (BTS) approach. Our study primarily addressed the potential adverse effects of data quality control measures (e.g., noise injection and balanced data) and training strategies in the context of synthetic-only experiments. As a result of the experiment, we observed the negative effect that the data-centric methodology drops by a maximum of 44.03 points in the fully synthetic data setting
A Survey on Evaluation Metrics for Machine Translation
The success of Transformer architecture has seen increased interest in machine translation (MT). The translation quality of neural network-based MT transcends that of translations derived using statistical methods. This growth in MT research has entailed the development of accurate automatic evaluation metrics that allow us to track the performance of MT. However, automatically evaluating and comparing MT systems is a challenging task. Several studies have shown that traditional metrics (e.g., BLEU, TER) show poor performance in capturing semantic similarity between MT outputs and human reference translations. To date, to improve performance, various evaluation metrics have been proposed using the Transformer architecture. However, a systematic and comprehensive literature review on these metrics is still missing. Therefore, it is necessary to survey the existing automatic evaluation metrics of MT to enable both established and new researchers to quickly understand the trend of MT evaluation over the past few years. In this survey, we present the trend of automatic evaluation metrics. To better understand the developments in the field, we provide the taxonomy of the automatic evaluation metrics. Then, we explain the key contributions and shortcomings of the metrics. In addition, we select the representative metrics from the taxonomy, and conduct experiments to analyze related problems. Finally, we discuss the limitation of the current automatic metric studies through the experimentation and our suggestions for further research to improve the automatic evaluation metrics