6 research outputs found
What If We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks
Nature language inference (NLI) task is a predictive task of determining the
inference relationship of a pair of natural language sentences. With the
increasing popularity of NLI, many state-of-the-art predictive models have been
proposed with impressive performances. However, several works have noticed the
statistical irregularities in the collected NLI data set that may result in an
over-estimated performance of these models and proposed remedies. In this
paper, we further investigate the statistical irregularities, what we refer as
confounding factors, of the NLI data sets. With the belief that some NLI labels
should preserve under swapping operations, we propose a simple yet effective
way (swapping the two text fragments) of evaluating the NLI predictive models
that naturally mitigate the observed problems. Further, we continue to train
the predictive models with our swapping manner and propose to use the deviation
of the model's evaluation performances under different percentages of training
text fragments to be swapped to describe the robustness of a predictive model.
Our evaluation metrics leads to some interesting understandings of recent
published NLI methods. Finally, we also apply the swapping operation on NLI
models to see the effectiveness of this straightforward method in mitigating
the confounding factor problems in training generic sentence embeddings for
other NLP transfer tasks.Comment: 8 pages, to appear at AAAI 1
Documentation-Guided Fuzzing for Testing Deep Learning API Functions
Widely-used deep learning (DL) libraries demand reliability. Thus, it is integral to test DL libraries’ API functions. Despite the effectiveness of fuzz testing, there are few techniques that are specialized in fuzzing API functions of DL libraries. To fill this gap, we design and implement a fuzzing technique called DocTer for API functions of DL libraries. Fuzzing DL API functions is challenging because many API functions expect structured inputs that follow DL-specific constraints. If a fuzzer is (1) unaware of these constraints or (2) incapable of using these constraints to fuzz, it is practically impossible to generate valid inputs, i.e., inputs that follow these DL-specific constraints, to explore deep to test the core functionality of API functions. DocTer extracts DL-specific constraints from API documents and uses these constraints to guide the fuzzing to generate valid inputs automatically. DocTer also generates inputs that violate these constraints to test the input validity checking code. To reduce manual effort, DocTer applies a sequential pattern mining technique on API documents to help DocTer users create rules to extract constraints from API documents automatically. Our evaluation on three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that DocTer’s accuracy in extracting input constraints is 82.2-90.5%. DocTer detects 46 bugs, while a baseline fuzzer without input constraints detects only 19 bugs. Most (33) of the 46 bugs are previously unknown, 26 of which have been fixed or confirmed by developers after we report them. In addition, DocTer detects 37 inconsistencies within documents, including 25 fixed or confirmed after we report them