12 research outputs found
Will My Tests Tell Me If I Break This Code?
Automated tests play an important role in software evolution because they can
rapidly detect faults introduced during changes. In practice, code-coverage
metrics are often used as criteria to evaluate the effectiveness of test suites
with focus on regression faults. However, code coverage only expresses which
portion of a system has been executed by tests, but not how effective the tests
actually are in detecting regression faults. Our goal was to evaluate the
validity of code coverage as a measure for test effectiveness. To do so, we
conducted an empirical study in which we applied an extreme mutation testing
approach to analyze the tests of open-source projects written in Java. We
assessed the ratio of pseudo-tested methods (those tested in a way such that
faults would not be detected) to all covered methods and judged their impact on
the software project. The results show that the ratio of pseudo-tested methods
is acceptable for unit tests but not for system tests (that execute large
portions of the whole system). Therefore, we conclude that the coverage metric
is only a valid effectiveness indicator for unit tests.Comment: 7 pages, 3 figure
Recommended from our members
Comparing the effectiveness of testing methods in improving programs: the effect of variations in program quality
We compare the efficacy of different testing methods for improving the reliability of software. Specifically, we use modelling to compare “operational” testing, in which test cases are chosen according to their probability of occurring in actual use of the software, against “debug” testing methods, in which the testers look for test cases which they consider likely to cause failure, or that satisfy some coverage criterion. We base our comparisons on the reliability reached by the program at the end of testing. Differently from previous studies, we consider the probability distribution of the achieved reliability, and thus the probability of satisfying specific requirements, rather than just the average reliability achieved. We take account of two sources of variation. The variation between the actual test histories that are possible for a given program and a given test method: and the fact that different programs start testing with different faults and initial reliability levels. By necessity, we use very simplified models of reality. Yet, we can show some interesting conclusions with important practical consequences. In general, there are stronger arguments in favor of operational testing than previous studies have show
Code coverage of adaptive random testing
Random testing is a basic software testing technique that can be used to assess the software reliability as well as to detect software failures. Adaptive random testing has been proposed to enhance the failure-detection capability of random testing. Previous studies have shown that adaptive random testing can use fewer test cases than random testing to detect the first software failure. In this paper, we evaluate and compare the performance of adaptive random testing and random testing from another perspective, that of code coverage. As shown in various investigations, a higher code coverage not only brings a higher failure-detection capability, but also improves the effectiveness of software reliability estimation. We conduct a series of experiments based on two categories of code coverage criteria: structure-based coverage, and fault-based coverage. Adaptive random testing can achieve higher code coverage than random testing with the same number of test cases. Our experimental results imply that, in addition to having a better failure-detection capability than random testing, adaptive random testing also delivers a higher effectiveness in assessing software reliability, and a higher confidence in the reliability of the software under test even when no failure is detected
Code coverage of adaptive random testing
Random testing is a basic software testing technique that can be used to assess the software reliability as well as to detect software failures. Adaptive random testing has been proposed to enhance the failure-detection capability of random testing. Previous studies have shown that adaptive random testing can use fewer test cases than random testing to detect the first software failure. In this paper, we evaluate and compare the performance of adaptive random testing and random testing from another perspective, that of code coverage. As shown in various investigations, a higher code coverage not only brings a higher failure-detection capability, but also improves the effectiveness of software reliability estimation. We conduct a series of experiments based on two categories of code coverage criteria: structure-based coverage, and fault-based coverage. Adaptive random testing can achieve higher code coverage than random testing with the same number of test cases. Our experimental results imply that, in addition to having a better failure-detection capability than random testing, adaptive random testing also delivers a higher effectiveness in assessing software reliability, and a higher confidence in the reliability of the software under test even when no failure is detected
Predicting Test Suite Effectiveness for Java Programs
The coverage of a test suite is often used as a proxy for its effectiveness. However, previous studies that investigated the influence of code coverage on test suite effectiveness have failed to reach a consensus about the nature and strength of the relationship between these test suite characteristics. Moreover, many of the studies were done with small or synthetic programs, making it unclear that their results generalize to larger programs. In addition, some of the studies did not account for the confounding influence of test suite size. We have extended these studies by evaluating the relationship between test suite size, block coverage, and effectiveness for large Java programs.
Our test subjects were four Java programs from different application domains: Apache POI, HSQLDB, JFreeChart, and Joda Time. All four are actively developed open source programs; they range from 80,000 to 284,000 source lines of code. For each test subject, we generated between 5,000 and 7,000 test suites by randomly selecting test methods from the program's entire test suite. The suites ranged in size from 3 to 3,000 methods. We used the coverage tool Emma to measure the block coverage of each suite and the mutation testing tool Javalanche to evaluate the effectiveness of each suite.
We found that there is a low correlation between block coverage and effectiveness when the number of tests in the suite is controlled for. This suggests that block coverage, while useful for identifying under-tested parts of a program, should not be used as a quality target because it is not a good indicator of test suite effectiveness
Recommended from our members
The Effectiveness of <i>t</i>-Way Test Data Generation
Modern society is increasingly dependent on the correct functioning of software and increasingly so in areas that are considered safety related or safety critical. Therefore, there is an increasing need to be able to verify and validate that the software is in fact correct and will perform its intended function. Many approaches to this problem have been proposed; however, none seems likely to supplant the role of testing in the near future.
If we accept that there is, and will be, a continuing need to be able to test software then the question becomes one of how can this be done effectively, both in terms of ability to detect errors and in terms of cost. One avenue of research that offers prospects of improving both of these aspects is the automatic generation of test data.
There has recently been a large amount of work conducted in this area. One particularly promising direction has been the application of ideas from the field of experimental design and in particular, the field of t-way adequate factorial designs.
The area however, is not without issues; there is evidence that the technique is capable of detecting errors but that evidence is not unequivocal. Moreover, as with almost all work in the area of automatic test generation, there has been very little comparative work comparing the technique with other test data generation techniques. Worse, there has been effectively no work done that compares any automatic test data generation technique with the effectiveness of tests generated by humans. Another major issue with the technique is the number of tests that applying the technique can result in. This implies that there is a need for an automated oracle if the technique is to be successfully applied. The flaw with this is of course that in most situations the oracle is the human that is conducting the tests, a point often ignored in testing research.
The work presented here addresses both of these points. To do this I have used a code base taken from an industrial engine control system that has an existing set of high quality unit tests developed by hand. To complement this, several other techniques for automatically generating test data have been applied, namely random testing, random experimental designs and a technique for generating single factor experiments. To address the issue of being able to compare the error detection ability of all of the sets of test vectors, rather than the usual effectiveness surrogates of code coverage I have used mutation analysis on the code base to directly measure the ability of each set of test vectors to discover common coding errors. The results presented here show that test data generation techniques based on t-way factorial designs are at least as effective as handgenerated tests and superior to random testing and the factor experimental technique.
The oracle problem associated with the factorial design techniques was addressed using a test set minimisation approach. The mutation tool monitored which vectors could “kill” which code mutants. After a subset of the test vectors had been run, the most effective vectors were retained and the rest discarded. Likewise, mutants that were killed were removed from further consideration and the process repeated. Experimental results show that this minimisation procedure is effective at reducing computational overhead and is capable of producing final sets of test vectors that are comparable in size with the sets of hand-generated tests and so amenable to final hand checking
Model based test suite minimization using metaheuristics
Software testing is one of the most widely used methods for quality assurance and fault detection purposes. However, it is one of the most expensive, tedious and time consuming activities in software development life cycle. Code-based and specification-based testing has been going on for almost four decades. Model-based testing (MBT) is a relatively new approach to software testing where the software models as opposed to other artifacts (i.e. source code) are used as primary source of test cases. Models are simplified representation of a software system and are cheaper to execute than the original or deployed system. The main objective of the research presented in this thesis is the development of a framework for improving the efficiency and effectiveness of test suites generated from UML models. It focuses on three activities: transformation of Activity Diagram (AD) model into Colored Petri Net (CPN) model, generation and evaluation of AD based test suite and optimization of AD based test suite. Unified Modeling Language (UML) is a de facto standard for software system analysis and design. UML models can be categorized into structural and behavioral models. AD is a behavioral type of UML model and since major revision in UML version 2.x it has a new Petri Nets like semantics. It has wide application scope including embedded, workflow and web-service systems. For this reason this thesis concentrates on AD models. Informal semantics of UML generally and AD specially is a major challenge in the development of UML based verification and validation tools. One solution to this challenge is transforming a UML model into an executable formal model. In the thesis, a three step transformation methodology is proposed for resolving ambiguities in an AD model and then transforming it into a CPN representation which is a well known formal language with extensive tool support. Test case generation is one of the most critical and labor intensive activities in testing processes. The flow oriented semantic of AD suits modeling both sequential and concurrent systems. The thesis presented a novel technique to generate test cases from AD using a stochastic algorithm. In order to determine if the generated test suite is adequate, two test suite adequacy analysis techniques based on structural coverage and mutation have been proposed. In terms of structural coverage, two separate coverage criteria are also proposed to evaluate the adequacy of the test suite from both perspectives, sequential and concurrent. Mutation analysis is a fault-based technique to determine if the test suite is adequate for detecting particular types of faults. Four categories of mutation operators are defined to seed specific faults into the mutant model. Another focus of thesis is to improve the test suite efficiency without compromising its effectiveness. One way of achieving this is identifying and removing the redundant test cases. It has been shown that the test suite minimization by removing redundant test cases is a combinatorial optimization problem. An evolutionary computation based test suite minimization technique is developed to address the test suite minimization problem and its performance is empirically compared with other well known heuristic algorithms. Additionally, statistical analysis is performed to characterize the fitness landscape of test suite minimization problems. The proposed test suite minimization solution is extended to include multi-objective minimization. As the redundancy is contextual, different criteria and their combination can significantly change the solution test suite. Therefore, the last part of the thesis describes an investigation into multi-objective test suite minimization and optimization algorithms. The proposed framework is demonstrated and evaluated using prototype tools and case study models. Empirical results have shown that the techniques developed within the framework are effective in model based test suite generation and optimizatio
Redefining and Evaluating Coverage Criteria Based on the Testing Scope
Test coverage information can help testers in deciding when to stop testing and in augmenting their test suites when the measured coverage is not deemed sufficient. Since the notion of a test criterion was introduced in the 70’s, research on coverage testing has been very active with much effort dedicated to the definition of new, more cost-effective, coverage criteria or to the adaptation of existing ones to a different domain. All these studies share the premise that after defining the entity to be covered (e.g., branches), one cannot consider a program to be adequately tested if some of its entities have never been exercised by any input data. However, it is not the case that all entities are of interest in every context. This is particularly true for several paradigms that emerged in the last decade (e.g., component-based development, service-oriented architecture). In such cases, traditional coverage metrics might not always provide meaningful information. In this thesis we address such situation and we redefine coverage criteria so to focus on the program parts that are relevant to the testing scope. We instantiate this general notion of scope-based coverage by introducing three coverage criteria and we demonstrate how they could be applied to different testing contexts. When applied to the context of software reuse, our approach proved to be useful for supporting test case prioritization, selection and minimization. Our studies showed that for prioritization we can improve the average rate of faults detected. For test case selection and minimization, we can considerably reduce the test suite size with small to no extra impact on fault detection effectiveness. When the source code is not available, such as in the service-oriented architecture paradigm, we propose an approach that customizes coverage, measured on invocations at service interface, based on data from similar users. We applied this approach to a real world application and, in our study, we were able to predict the entities that would be of interest for a given user with high precision. Finally, we introduce the first of its kind coverage criterion for operational profile based testing that exploits program spectra obtained from usage traces. Our study showed that it is better correlated than traditional coverage with the probability that the next test input will fail, which implies that our approach can provide a better stopping rule. Promising results were also observed for test case selection. Our redefinition of coverage criteria approaches the topic of coverage testing from a completely different angle. Such a novel perspective paves the way for new avenues of research towards improving the cost-effectiveness of testing, yet all to be explored