19 research outputs found
Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction
Text error correction aims to correct the errors in text sequences such as
those typed by humans or generated by speech recognition models. Previous error
correction methods usually take the source (incorrect) sentence as encoder
input and generate the target (correct) sentence through the decoder. Since the
error rate of the incorrect sentence is usually low (e.g., 10\%), the
correction model can only learn to correct on limited error tokens but
trivially copy on most tokens (correct tokens), which harms the effective
training of error correction. In this paper, we argue that the correct tokens
should be better utilized to facilitate effective training and then propose a
simple yet effective masking strategy to achieve this goal. Specifically, we
randomly mask out a part of the correct tokens in the source sentence and let
the model learn to not only correct the original error tokens but also predict
the masked tokens based on their context information. Our method enjoys several
advantages: 1) it alleviates trivial copy; 2) it leverages effective training
signals from correct tokens; 3) it is a plug-and-play module and can be applied
to different models and tasks. Experiments on spelling error correction and
speech recognition error correction on Mandarin datasets and grammar error
correction on English datasets with both autoregressive and non-autoregressive
generation models show that our method improves the correction accuracy
consistently.Comment: main track of EMNLP 202
DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework for Spelling Error Correction of Bangla and Resource Scarce Indic Languages
Spelling error correction is the task of identifying and rectifying
misspelled words in texts. It is a potential and active research topic in
Natural Language Processing because of numerous applications in human language
understanding. The phonetically or visually similar yet semantically distinct
characters make it an arduous task in any language. Earlier efforts on spelling
error correction in Bangla and resource-scarce Indic languages focused on
rule-based, statistical, and machine learning-based methods which we found
rather inefficient. In particular, machine learning-based approaches, which
exhibit superior performance to rule-based and statistical methods, are
ineffective as they correct each character regardless of its appropriateness.
In this work, we propose a novel detector-purificator-corrector framework based
on denoising transformers by addressing previous issues. Moreover, we present a
method for large-scale corpus creation from scratch which in turn resolves the
resource limitation problem of any left-to-right scripted language. The
empirical outcomes demonstrate the effectiveness of our approach that
outperforms previous state-of-the-art methods by a significant margin for
Bangla spelling error correction. The models and corpus are publicly available
at https://tinyurl.com/DPCSpell.Comment: 23 pages, 4 figures, and 7 table
A Survey of Natural Language Generation
This paper offers a comprehensive review of the research on Natural Language
Generation (NLG) over the past two decades, especially in relation to
data-to-text generation and text-to-text generation deep learning methods, as
well as new applications of NLG technology. This survey aims to (a) give the
latest synthesis of deep learning research on the NLG core tasks, as well as
the architectures adopted in the field; (b) detail meticulously and
comprehensively various NLG tasks and datasets, and draw attention to the
challenges in NLG evaluation, focusing on different evaluation methods and
their relationships; (c) highlight some future emphasis and relatively recent
research issues that arise due to the increasing synergy between NLG and other
artificial intelligence areas, such as computer vision, text and computational
creativity.Comment: Accepted by ACM Computing Survey (CSUR) 202
조건부 텍스트 생성 시스템에 대한 사실 관계의 일관성 평가
학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022. 8. 정교민.최근의 사전학습 언어모델의 활용을 통한 조건부 텍스트 생성 시스템들의 발전에도 불구하고, 시스템들의 사실 관계의 일관성은 여전히 충분하지 않은 편이다. 그러나 널리 사용되는 n-그램 기반 유사성 평가 기법은 사실 일관성 평가에 매우
취약하다. 따라서, 사실 일관된 텍스트 생성 시스템을 개발하기 위해서는 먼저 시스템의 사실 관계를 제대로 평가할 수 있는 자동 평가 기법이 필요하다. 본 논문에서는 다양한 조건부 텍스트 생성 시스템에 대해, 이전 평가 기법보다 사실 관계 일관성 평가에서 인간의 판단과 매우 높은 상관관계를 보여주는 4가지 평가 기법을 제안한다. 이 기법들은 (1) 보조 태스크 활용 및 (2) 데이터 증강 기법 등을 활용한다.
첫째로, 우리는 중요한 핵심 단어또는 핵심 구문에 초점을 맞춘 두 가지 다른 보조 태스크를 활용하여 두 가지 사실 관계의 일관성 평가 기법을 제안한다. 우리는 먼저 핵심 구문의 가중치 예측 태스크를 이전 평가 기법에 결합하여 주관식 질의
응답을 위한 평가 기법을 제안한다. 또한, 우리는 질의 생성 및 응답을 활용하여 키워드에 대한 질의를 생성하고, 이미지와 캡션에 대한 질문의 답을 비교하여 사실 일관성을 확인하는 QACE를 제안한다.
둘째로, 우리는 보조 태스크 활용과 달리, 데이터 기반 방식의 학습을 통해 두 가지의 평가 기법을 제안한다. 구체적으로, 우리는 증강된 일관성 없는 텍스트를 일관성 있는 텍스트와 구분하도록 훈련한다. 먼저 규칙 기반 변형을 통한 불일치 캡션
생성으로 이미지 캡션 평가 지표 UMIC을 제안한다. 다음 단계로, 마스킹된 소스와 마스킹된 요약을 사용하여 일관성이 없는 요약을 생성하는 MFMA를 통해 평가 지표를 개발한다. 마지막으로, 데이터 기반 사실 일관성 평가 기법 개발의 확장으로, 시스템의 사실 관계 오류를 수정할 수 있는 빠른 사후 교정 시스템을 제안한다.Despite the recent advances of conditional text generation systems leveraged from pre-trained language models, factual consistency of the systems are still not sufficient. However, widely used n-gram similarity metrics are vulnerable to evaluate the factual consistency. Hence, in order to develop a factual consistent system, an automatic factuality metric is first necessary. In this dissertation, we propose four metrics that show very higher correlation with human judgments than previous metrics in evaluating factual consistency, for diverse conditional text generation systems. To build such metrics, we utilize (1) auxiliary tasks and (2) data augmentation methods.
First, we focus on the keywords or keyphrases that are critical for evaluating factual consistency and propose two factual consistency metrics using two different auxiliary tasks. We first integrate the keyphrase weights prediction task to the previous metrics to propose a KPQA (Keyphrase Prediction for Question Answering)-metric for generative QA. Also, we apply question generation and answering to develop a captioning metric QACE (Question Answering for Captioning Evaluation). QACE generates questions on the keywords of the candidate. QACE checks the factual consistency by comparing the answers of these questions for the source image and the caption.
Secondly, different from using auxiliary tasks, we directly train a metric with a data-driven approach to propose two metrics. Specifically, we train a metric to distinguish augmented inconsistent texts with the consistent text. We first modify the original reference captions to generate inconsistent captions using several rule-based methods such as substituting keywords to propose UMIC (Unreferenced Metric for Image Captioning). As a next step, we introduce a MFMA (Mask-and-Fill with Masked-Article)-metric by generating inconsistent summary using the masked source and the masked summary. Finally, as an extension of developing data-driven factual consistency metrics, we also propose a faster post-editing system that can fix the factual errors in the system.1 Introduction 1
2 Background 10
2.1 Text Evaluation Metrics 10
2.1.1 N-gram Similarity Metrics 10
2.1.2 Embedding Similarity Metrics 12
2.1.3 Auxiliary Task Based Metrics 12
2.1.4 Entailment Based Metrics 13
2.2 Evaluating Automated Metrics 14
3 Integrating Keyphrase Weights for Factual Consistency Evaluation 15
3.1 Related Work 17
3.2 Proposed Approach: KPQA-Metric 18
3.2.1 KPQA 18
3.2.2 KPQA Metric 19
3.3 Experimental Setup and Dataset 23
3.3.1 Dataset 23
3.3.2 Implementation Details 26
3.4 Empirical Results 27
3.4.1 Comparison with Other Methods 27
3.4.2 Analysis 29
3.5 Conclusion 35
4 Question Generation and Question Answering for Factual Consistency Evaluation 36
4.1 Related Work 37
4.2 Proposed Approach: QACE 38
4.2.1 Question Generation 38
4.2.2 Question Answering 39
4.2.3 Abstractive Visual Question Answering 40
4.2.4 QACE Metric 42
4.3 Experimental Setup and Dataset 43
4.3.1 Dataset 43
4.3.2 Implementation Details 44
4.4 Empirical Results 45
4.4.1 Comparison with Other Methods 45
4.4.2 Analysis 46
4.5 Conclusion 48
5 Rule-Based Inconsistent Data Augmentation for Factual Consistency Evaluation 49
5.1 Related Work 51
5.2 Proposed Approach: UMIC 52
5.2.1 Modeling 52
5.2.2 Negative Samples 53
5.2.3 Contrastive Learning 55
5.3 Experimental Setup and Dataset 56
5.3.1 Dataset 56
5.3.2 Implementation Details 60
5.4 Empirical Results 61
5.4.1 Comparison with Other Methods 61
5.4.2 Analysis 62
5.5 Conclusion 65
6 Inconsistent Data Augmentation with Masked Generation for Factual Consistency Evaluation 66
6.1 Related Work 68
6.2 Proposed Approach: MFMA and MSM 70
6.2.1 Mask-and-Fill with Masked Article 71
6.2.2 Masked Summarization 72
6.2.3 Training Factual Consistency Checking Model 72
6.3 Experimental Setup and Dataset 73
6.3.1 Dataset 73
6.3.2 Implementation Details 74
6.4 Empirical Results 75
6.4.1 Comparison with Other Methods 75
6.4.2 Analysis 78
6.5 Conclusion 84
7 Factual Error Correction for Improving Factual Consistency 85
7.1 Related Work 87
7.2 Proposed Approach: RFEC 88
7.2.1 Problem Formulation 88
7.2.2 Training Dataset Construction 89
7.2.3 Evidence Sentence Retrieval 90
7.2.4 Entity Retrieval Based Factual Error Correction 90
7.3 Experimental Setup and Dataset 92
7.3.1 Dataset 92
7.3.2 Implementation Details 93
7.4 Empirical Results 93
7.4.1 Comparison with Other Methods 93
7.4.2 Analysis 95
7.5 Conclusion 95
8 Conclusion 97
Abstract (In Korean) 118박
Reinforcement Learning for Machine Translation: from Simulations to Real-World Applications
If a machine translation is wrong, how we can tell the underlying model to fix it? Answering this question requires (1) a machine learning algorithm to define update rules, (2) an interface for feedback to be submitted, and (3) expertise on the side of the human who gives the feedback. This thesis investigates solutions for machine learning updates, the suitability of feedback interfaces, and the dependency on reliability and expertise for different types of feedback.
We start with an interactive online learning scenario where a machine translation (MT) system receives bandit feedback (i.e. only once per source) instead of references for learning. Policy gradient algorithms for statistical and neural MT are developed to learn from absolute and pairwise judgments. Our experiments on domain adaptation with simulated online feedback show that the models can largely improve under weak feedback, with variance reduction techniques being very effective.
In production environments offline learning is often preferred over online learning. We evaluate algorithms for counterfactual learning from human feedback in a study on eBay product title translations. Feedback is either collected via explicit star ratings from users, or implicitly from the user interaction with cross-lingual product search. Leveraging implicit feedback turns out to be more successful due to lower levels of noise. We compare the reliability and learnability of absolute Likert-scale ratings with pairwise preferences in a smaller user study, and find that absolute ratings are overall more effective for improvements in down-stream tasks. Furthermore, we discover that error markings provide a cheap and practical alternative to error corrections.
In a generalized interactive learning framework we propose a self-regulation approach, where the learner, guided by a regulator module, decides which type of feedback to choose for each input. The regulator is reinforced to find a good trade-off between supervision effect and cost. In our experiments, it discovers strategies that are more efficient than active learning and standard fully supervised learning
Automated Identification of Severe Errors in Speech to Text Transcripts
In this thesis we explore how problematic misplaced words can be automatically identified in speech-to-text-transcripts. Automatic Speech Recognition systems (ASR) are systems that can automatically generate text from human speech. Because natural language spoken by humans is complex, due to dialects, variations in talking speed, and differences in how humans talk compared to the training data, there might be errors introduced by such ASR systems. Sometimes, these errors are so bad that they become problematic. Post-processing of an ASR system means finding such errors after the text has been generated by the system. We want to find out to what degree probabilities of words computed using pre-trained language models can be used to solve this problem, as well as to what degree these probabilities can be used to create a classifier to detect problematic words. We present our solution, where we synthetically introduce problematic words into text documents. Then we compute probabilities of both problematic and non-problematic words in these documents to investigate if they are treated differently by the models. We show that the models generally assign lower probabilities to problematic words and higher probabilities to good words. We train a logistic regression classifier using these probabilities to classify words. Our results show that using probabilities from NorBERT1 and NorBERT2, a logistic regression classifier can accurately detect problematic words. We also show that NB-BERT performs worse than a baseline bigram model.Masteroppgave i informasjonsvitenskapINFO390MASV-INF
Data augmentation and subword segmentation for spell-checking in amazonian languages
En el Perú se han identificado 48 lenguas originarias, según la información extraída
de la Base de Datos oficial de Pueblos Indígenas u originarios (BDPI). Estas son de
tradición oral [BDPI, 2020]. Por lo que no había una forma oficial de enseñanza. El
Instituto Linguistico de Verano (ILV) recopiló y documentó diversas lenguas nativas
[Faust, 1973], como un primer intento para tener un documento formal para la
enseñanza de una lengua originaria. Fue después que el Gobierno Peruano con su
estrategia de inclusión social “Incluir para crecer” creó una guía oficial para la
enseñanza de las lenguas originarias en su intento de normalizar el uso de estas
lenguas [Jara Males, Gonzales Acer, 2015].
Como se menciona en [Forcada, 2016], el uso de tecnologías del lenguaje permite
obtener una normalidad, incremento de literatura, estandarización y mayor
visibilidad. En el caso de Perú, ha habido iniciativas, como analizadores morfológicos
[Pereira-Noriega, et al., 2017] o correctores ortográficos [Alva, Oncevay, 2017],
enfocados en las lenguas originarias de escasos recursos computacionales que
pretenden apoyar el esfuerzo de revitalización, la educación indígena y la
documentación de las lenguas [Zariquiey et al., 2019].
Enfocándose en lenguas amazónicas se realizó un proyecto utilizando redes
neuronales para desarrollar un corrector ortográfico enfocado en las lenguas
originarias con buenos resultados a nivel de precisión [Lara, 2020]. En ese trabajo, al
disponer de poca cantidad de datos se generaron datos sintéticos con un método
aleatorio los cuales al ser evaluados con las métricas CharacTER [Wang, et al., 2016]
y BLEU [Papineni, et al., 2002] obtuvieron resultados bastante bajos. Además, las
lenguas amazónicas al ser ricas a nivel morfológico y tener un vocabulario extenso es
difícil representar palabras fuera del vocabulario, por lo que es recomendable usar
sub-palabras como término medio [Wu, Zhao, 2018].
El presente proyecto desarrolla distintos métodos de generación de datos, diferentes
al aleatorio, que son más robustos al considerar errores que son más cercanos a la
realidad. A su vez, para reducir el costo computacional y mantener la capacidad de
generar un vocabulario abierto, adicionalmente se entrena redes neuronales que
reciban como entrada sub-palabras tales como sílabas y segmentos divididos por byte
pair encoding (BPE). Finalmente, de los experimentos concluimos que hubo mejoras
con los métodos y la segmentación propuesta y se tienen más recursos
computacionales para nuestras lenguas amazónicas
A Survey of Learning-based Automated Program Repair
Automated program repair (APR) aims to fix software bugs automatically and
plays a crucial role in software development and maintenance. With the recent
advances in deep learning (DL), an increasing number of APR techniques have
been proposed to leverage neural networks to learn bug-fixing patterns from
massive open-source code repositories. Such learning-based techniques usually
treat APR as a neural machine translation (NMT) task, where buggy code snippets
(i.e., source language) are translated into fixed code snippets (i.e., target
language) automatically. Benefiting from the powerful capability of DL to learn
hidden relationships from previous bug-fixing datasets, learning-based APR
techniques have achieved remarkable performance. In this paper, we provide a
systematic survey to summarize the current state-of-the-art research in the
learning-based APR community. We illustrate the general workflow of
learning-based APR techniques and detail the crucial components, including
fault localization, patch generation, patch ranking, patch validation, and
patch correctness phases. We then discuss the widely-adopted datasets and
evaluation metrics and outline existing empirical studies. We discuss several
critical aspects of learning-based APR techniques, such as repair domains,
industrial deployment, and the open science issue. We highlight several
practical guidelines on applying DL techniques for future APR studies, such as
exploring explainable patch generation and utilizing code features. Overall,
our paper can help researchers gain a comprehensive understanding about the
achievements of the existing learning-based APR techniques and promote the
practical application of these techniques. Our artifacts are publicly available
at \url{https://github.com/QuanjunZhang/AwesomeLearningAPR}