2,751 research outputs found
Fault Detection Effectiveness of Metamorphic Relations Developed for Testing Supervised Classifiers
In machine learning, supervised classifiers are used to obtain predictions
for unlabeled data by inferring prediction functions using labeled data.
Supervised classifiers are widely applied in domains such as computational
biology, computational physics and healthcare to make critical decisions.
However, it is often hard to test supervised classifiers since the expected
answers are unknown. This is commonly known as the \emph{oracle problem} and
metamorphic testing (MT) has been used to test such programs. In MT,
metamorphic relations (MRs) are developed from intrinsic characteristics of the
software under test (SUT). These MRs are used to generate test data and to
verify the correctness of the test results without the presence of a test
oracle. Effectiveness of MT heavily depends on the MRs used for testing. In
this paper we have conducted an extensive empirical study to evaluate the fault
detection effectiveness of MRs that have been used in multiple previous studies
to test supervised classifiers. Our study uses a total of 709 reachable mutants
generated by multiple mutation engines and uses data sets with varying
characteristics to test the SUT. Our results reveal that only 14.8\% of these
mutants are detected using the MRs and that the fault detection effectiveness
of these MRs do not scale with the increased number of mutants when compared to
what was reported in previous studies.Comment: 8 pages, AITesting 201
The Oracle Problem in Software Testing: A Survey
Testing involves examining the behaviour of a system in order to discover potential faults. Given an input for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentially incorrect behavior is called the “test oracle problem”. Test oracle automation is important to remove a current bottleneck that inhibits greater overall test automation. Without test oracle automation, the human has to determine whether observed behaviour is correct. The literature on test oracles has introduced techniques for oracle automation, including modelling, specifications, contract-driven development and metamorphic testing. When none of these is completely adequate, the final source of test oracle information remains the human, who may be aware of informal specifications, expectations, norms and domain specific information that provide informal oracle guidance. All forms of test oracles, even the humble human, involve challenges of reducing cost and increasing benefit. This paper provides a comprehensive survey of current approaches to the test oracle problem and an analysis of trends in this important area of software testing research and practice
The Earth is Flat? Unveiling Factual Errors in Large Language Models
Large Language Models (LLMs) like ChatGPT are foundational in various
applications due to their extensive knowledge from pre-training and
fine-tuning. Despite this, they are prone to generating factual and commonsense
errors, raising concerns in critical areas like healthcare, journalism, and
education to mislead users. Current methods for evaluating LLMs' veracity are
limited by test data leakage or the need for extensive human labor, hindering
efficient and accurate error detection. To tackle this problem, we introduce a
novel, automatic testing framework, FactChecker, aimed at uncovering factual
inaccuracies in LLMs. This framework involves three main steps: First, it
constructs a factual knowledge graph by retrieving fact triplets from a
large-scale knowledge database. Then, leveraging the knowledge graph,
FactChecker employs a rule-based approach to generates three types of questions
(Yes-No, Multiple-Choice, and WH questions) that involve single-hop and
multi-hop relations, along with correct answers. Lastly, it assesses the LLMs'
responses for accuracy using tailored matching strategies for each question
type. Our extensive tests on six prominent LLMs, including text-davinci-002,
text-davinci-003, ChatGPT~(gpt-3.5-turbo, gpt-4), Vicuna, and LLaMA-2, reveal
that FactChecker can trigger factual errors in up to 45\% of questions in these
models. Moreover, we demonstrate that FactChecker's test cases can improve
LLMs' factual accuracy through in-context learning and fine-tuning (e.g.,
llama-2-13b-chat's accuracy increase from 35.3\% to 68.5\%). We are making all
code, data, and results available for future research endeavors
Perfect is the enemy of test oracle
Automation of test oracles is one of the most challenging facets of software
testing, but remains comparatively less addressed compared to automated test
input generation. Test oracles rely on a ground-truth that can distinguish
between the correct and buggy behavior to determine whether a test fails
(detects a bug) or passes. What makes the oracle problem challenging and
undecidable is the assumption that the ground-truth should know the exact
expected, correct, or buggy behavior. However, we argue that one can still
build an accurate oracle without knowing the exact correct or buggy behavior,
but how these two might differ. This paper presents SEER, a learning-based
approach that in the absence of test assertions or other types of oracle, can
determine whether a unit test passes or fails on a given method under test
(MUT). To build the ground-truth, SEER jointly embeds unit tests and the
implementation of MUTs into a unified vector space, in such a way that the
neural representation of tests are similar to that of MUTs they pass on them,
but dissimilar to MUTs they fail on them. The classifier built on top of this
vector representation serves as the oracle to generate "fail" labels, when test
inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive
experiments on applying SEER to more than 5K unit tests from a diverse set of
open-source Java projects show that the produced oracle is (1) effective in
predicting the fail or pass labels, achieving an overall accuracy, precision,
recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting
the labels for the unit test of projects that were not in training or
validation set with negligible performance drop, and (3) efficient, detecting
the existence of bugs in only 6.5 milliseconds on average.Comment: Published in ESEC/FSE 202
Automated metamorphic testing of variability analysis tools
Variability determines the capability of software applications to be configured and customized. A common
need during the development of variability–intensive systems is the automated analysis of their underlying
variability models, e.g. detecting contradictory configuration options. The analysis operations that are
performed on variability models are often very complex, which hinders the testing of the corresponding
analysis tools and makes difficult, often infeasible, to determine the correctness of their outputs, i.e.
the well–known oracle problem in software testing. In this article, we present a generic approach for
the automated detection of faults in variability analysis tools overcoming the oracle problem. Our work
enables the generation of random variability models together with the exact set of valid configurations
represented by these models. These test data are generated from scratch using step–wise transformations
and assuring that certain constraints (a.k.a. metamorphic relations) hold at each step. To show the feasibility
and generalizability of our approach, it has been used to automatically test several analysis tools in three
variability domains: feature models, CUDF documents and Boolean formulas. Among other results, we
detected 19 real bugs in 7 out of the 15 tools under test.CICYT TIN2012-32273CICYT IPT-2012- 0890-390000Junta de Andalucía TIC-5906Junta de Andalucía P12-TIC- 186
Automating test oracles generation
Software systems play a more and more important role in our everyday life. Many relevant human activities nowadays involve the execution of a piece of software. Software has to be reliable to deliver the expected behavior, and assessing the quality of software is of primary importance to reduce the risk of runtime errors. Software testing is the most common quality assessing technique for software. Testing consists in running the system under test on a finite set of inputs, and checking the correctness of the results. Thoroughly testing a software system is expensive and requires a lot of manual work to define test inputs (stimuli used to trigger different software behaviors) and test oracles (the decision procedures checking the correctness of the results). Researchers have addressed the cost of testing by proposing techniques to automatically generate test inputs. While the generation of test inputs is well supported, there is no way to generate cost-effective test oracles: Existing techniques to produce test oracles are either too expensive to be applied in practice, or produce oracles with limited effectiveness that can only identify blatant failures like system crashes. Our intuition is that cost-effective test oracles can be generated using information produced as a byproduct of the normal development activities. The goal of this thesis is to create test oracles that can detect faults leading to semantic and non-trivial errors, and that are characterized by a reasonable generation cost. We propose two ways to generate test oracles, one derives oracles from the software redundancy and the other from the natural language comments that document the source code of software systems. We present a technique that exploits redundant sequences of method calls encoding the software redundancy to automatically generate test oracles named CCOracles. We describe how CCOracles are automatically generated, deployed, and executed. We prove the effectiveness of CCOracles by measuring their fault-finding effectiveness when combined with both automatically generated and hand-written test inputs. We also present Toradocu, a technique that derives executable specifications from Javadoc comments of Java constructors and methods. From such specifications, Toradocu generates test oracles that are then deployed into existing test suites to assess the outputs of given test inputs. We empirically evaluate Toradocu, showing that Toradocu accurately translates Javadoc comments into procedure specifications. We also show that Toradocu oracles effectively identify semantic faults in the SUT. CCOracles and Toradocu oracles stem from independent information sources and are complementary in the sense that they check different aspects of the system undertest
Metamorphic Testing for Software Libraries and Graphics Compilers
Metamorphic Testing is a testing technique which mutates existing test cases in semantically equivalent forms, by making use of metamorphic relations, while avoiding the oracle problem. However, these required relations are not readily available for a given system under test. Defining effective metamorphic relations is difficult, and arguably the main obstacle towards adoption of metamorphic testing in production-level software development. One example application is testing graphics compilers, where the approximate and under-specified nature of the domain makes it hard to apply more traditional techniques. We propose an approach with a lower barrier of entry to applying metamorphic testing for a software library. The user must still identify relations that hold over their particular library, but can do so within a development-like environment. We apply methods from the domains of metamorphic testing and fuzzing to produce complex test cases. We consider the user interaction a bonus, as they can control what parts of the target codebase is tested, potentially focusing on less-tested or critical sections of the codebase. We implement our proposed approach in a tool, MF++, which synthesises C++ test cases for a C++ library, defined by user-provided ingredients. We applied MF++ to 7 libraries in the domains of satisfiability modulo theories and Presburger arithmetic,. Our evaluation of MF++ was able to identify 21 bugs in these tools. We additionally provide an automatic reducer for tests generated by MF++, named MF++R. In addition to minimising tests exposing issues, MF++R can also be used to identify incorrect user-provided relations. Additionally, we investigate the combined use of MF++ and MF++R in order to augment code coverage of library test suites. We assess the utility of this application by contributing 21 tests aimed at improving coverage across 3 libraries.Open Acces
Automated inference of likely metamorphic relations for model transformations
Model transformations play a cornerstone role in Model-Driven Engineering (MDE) as they provide the essential mechanisms for
manipulating and transforming models. Checking whether the output of a model transformation is correct is a manual and errorprone
task, referred to as the oracle problem. Metamorphic testing alleviates the oracle problem by exploiting the relations among
di erent inputs and outputs of the program under test, so-called metamorphic relations (MRs). One of the main challenges in
metamorphic testing is the automated inference of likely MRs.
This paper proposes an approach to automatically infer likely MRs for ATL model transformations, where the tester does not
need to have any knowledge of the transformation. The inferred MRs aim at detecting faults in model transformations in three
application scenarios, namely regression testing, incremental transformations and migrations among transformation languages. In
the experiments performed, the inferred likely MRs have proved to be quite accurate, with a precision of 96.4% from a total of 4101
true positives out of 4254 MRs inferred. Furthermore, they have been useful for identifying mutants in regression testing scenarios,
with a mutation score of 93.3%. Finally, our approach can be used in conjunction with current approaches for the automatic
generation of test cases.Comisión Interministerial de Ciencia y Tecnología TIN2015-70560-RJunta de Andalucía P12-TIC-186
- …