461 research outputs found
Finite-context Indexing of Restricted Output Space for NLP Models Facing Noisy Input
NLP models excel on tasks with clean inputs, but are less accurate with noisy
inputs. In particular, character-level noise such as human-written typos and
adversarially-engineered realistic-looking misspellings often appears in text
and can easily trip up NLP models. Prior solutions to address character-level
noise often alter the content of the inputs (low fidelity), thus inadvertently
lowering model accuracy on clean inputs. We proposed FiRo, an approach to boost
NLP model performance on noisy inputs without sacrificing performance on clean
inputs. FiRo sanitizes the input text while preserving its fidelity by
inferring the noise-free form for each token in the input. FiRo uses
finite-context aggregation to obtain contextual embeddings which is then used
to find the noise-free form within a restricted output space. The output space
is restricted to a small cluster of probable candidates in order to predict the
noise-free tokens more accurately. Although the clusters are small, FiRo's
effective vocabulary (union of all clusters) can be scaled up to better
preserve the input content. Experimental results show NLP models that use FiRo
outperforming baselines on six classification tasks and one sequence labeling
task at various degrees of noise.Comment: Accepted at IJCNLP-AACL 202
Training Datasets for Machine Reading Comprehension and Their Limitations
Neural networks are a powerful model class to learn machine Reading Comprehen- sion (RC), yet they crucially depend on the availability of suitable training datasets. In this thesis we describe methods for data collection, evaluate the performance of established models, and examine a number of model behaviours and dataset limita- tions. We first describe the creation of a data resource for the science exam QA do- main, and compare existing models on the resulting dataset. The collected ques- tions are plausible – non-experts can distinguish them from real exam questions with 55% accuracy – and using them as additional training data leads to improved model scores on real science exam questions. Second, we describe and apply a distant supervision dataset construction method for multi-hop RC across documents. We identify and mitigate several dataset assembly pitfalls – a lack of unanswerable candidates, label imbalance, and spurious correlations between documents and particular candidates – which often leave shallow predictive cues for the answer. Furthermore we demonstrate that se- lecting relevant document combinations is a critical performance bottleneck on the datasets created. We thus investigate Pseudo-Relevance Feedback, which leads to improvements compared to TF-IDF-based document combination selection both in retrieval metrics and answer accuracy. Third, we investigate model undersensitivity: model predictions do not change when given adversarially altered questions in SQUAD2.0 and NEWSQA, even though they should. We characterise affected samples, and show that the phe- nomenon is related to a lack of structurally similar but unanswerable samples during training: data augmentation reduces the adversarial error rate, e.g. from 51.7% to 20.7% for a BERT model on SQUAD2.0, and improves robustness also in other settings. Finally we explore efficient formal model verification via Interval Bound Propagation (IBP) to measure and address model undersensitivity, and show that using an IBP-derived auxiliary loss can improve verification rates, e.g. from 2.8% to 18.4% on the SNLI test set
Can NMT Understand Me? Towards Perturbation-based Evaluation of NMT Models for Code Generation
Neural Machine Translation (NMT) has reached a level of maturity to be
recognized as the premier method for the translation between different
languages and aroused interest in different research areas, including software
engineering. A key step to validate the robustness of the NMT models consists
in evaluating the performance of the models on adversarial inputs, i.e., inputs
obtained from the original ones by adding small amounts of perturbation.
However, when dealing with the specific task of the code generation (i.e., the
generation of code starting from a description in natural language), it has not
yet been defined an approach to validate the robustness of the NMT models. In
this work, we address the problem by identifying a set of perturbations and
metrics tailored for the robustness assessment of such models. We present a
preliminary experimental evaluation, showing what type of perturbations affect
the model the most and deriving useful insights for future directions.Comment: Paper accepted for publication in the proceedings of The 1st Intl.
Workshop on Natural Language-based Software Engineering (NLBSE) to be held
with ICSE 202
- …