68 research outputs found

    Fault Revealing Test Oracles, Are We There Yet? Evaluating The Effectiveness Of Automatically Generated Test Oracles On Manually-Written And Automatically Generated Unit Tests

    Get PDF
    Tese de mestrado, Engenharia Informática, 2022, Universidade de Lisboa, Faculdade de CiênciasAutomated test suite generation tools have been used in real development scenarios and proven to be able to detect real faults. These tools, however, do not know the expected behavior of the system and generate tests that execute the faulty behavior, but fail to identify the fault due to poor test oracles. To solve this problem, researchers have developed several approaches to automatically generate test oracles that resemble manually-written ones. However, there remain some questions regarding the use of these tools in real development scenarios. In particular, how effective are automatically generated test oracles at revealing real faults? How long do these tools require to generate an oracle? To answer these questions, we applied a recent and promising test oracle generation approach (T5) to all fault-revealing test cases in the DEFECTS4J collection and investigated how effective are the generated test oracles at detecting real faults as well as the time required by the tool to generate them; Our results show that: (1) out-of-the-box, oracles generated by T5 do not compile; (2) after a simple procedure, out of the 1696 test oracles, only 466 compile and 58 of them manage to correctly identify the fault; (3) when considering the 835 bugs in DEFECTS4J, T5 was able to detect 27, i.e., 3.23% of the bugs. Moreover, T5 required, on average, 401.3 seconds to generate a test oracle. The approaches and datasets presented in this thesis bring automated test oracle generation one step closer to being used in real software, by providing insight into current problems of several tools as well as introducing a way to test automated test oracle generation tools that are being developed regarding their effectiveness on detecting real software faults

    Automated Unit Testing of Evolving Software

    Get PDF
    As software programs evolve, developers need to ensure that new changes do not affect the originally intended functionality of the program. To increase their confidence, developers commonly write unit tests along with the program, and execute them after a change is made. However, manually writing these unit-tests is difficult and time-consuming, and as their number increases, so does the cost of executing and maintaining them. Automated test generation techniques have been proposed in the literature to assist developers in the endeavour of writing these tests. However, it remains an open question how well these tools can help with fault finding in practice, and maintaining these automatically generated tests may require extra effort compared to human written ones. This thesis evaluates the effectiveness of a number of existing automatic unit test generation techniques at detecting real faults, and explores how these techniques can be improved. In particular, we present a novel multi-objective search-based approach for generating tests that reveal changes across two versions of a program. We then investigate whether these tests can be used such that no maintenance effort is necessary. Our results show that overall, state-of-the-art test generation tools can indeed be effective at detecting real faults: collectively, the tools revealed more than half of the bugs we studied. We also show that our proposed alternative technique that is better suited to the problem of revealing changes, can detect more faults, and does so more frequently. However, we also find that for a majority of object-oriented programs, even a random search can achieve good results. Finally, we show that such change-revealing tests can be generated on demand in practice, without requiring them to be maintained over time

    A detailed investigation of the effectiveness of whole test suite generation

    Get PDF
    © 2016 The Author(s)A common application of search-based software testing is to generate test cases for all goals defined by a coverage criterion (e.g., lines, branches, mutants). Rather than generating one test case at a time for each of these goals individually, whole test suite generation optimizes entire test suites towards satisfying all goals at the same time. There is evidence that the overall coverage achieved with this approach is superior to that of targeting individual coverage goals. Nevertheless, there remains some uncertainty on (a) whether the results generalize beyond branch coverage, (b) whether the whole test suite approach might be inferior to a more focused search for some particular coverage goals, and (c) whether generating whole test suites could be optimized by only targeting coverage goals not already covered. In this paper, we perform an in-depth analysis to study these questions. An empirical study on 100 Java classes using three different coverage criteria reveals that indeed there are some testing goals that are only covered by the traditional approach, although their number is only very small in comparison with those which are exclusively covered by the whole test suite approach. We find that keeping an archive of already covered goals along with the tests covering them and focusing the search on uncovered goals overcomes this small drawback on larger classes, leading to an improved overall effectiveness of whole test suite generation

    Test smells 20 years later : detectability, validity, and reliability

    Get PDF
    Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch)Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools’ detection accuracy and investigate the prevalence and detectability of test smells more broadly. Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multistage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools – one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells. Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not wellcaptured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool’s detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts

    Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing

    Full text link
    One of the critical phases in software development is software testing. Testing helps with identifying potential bugs and reducing maintenance costs. The goal of automated test generation tools is to ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged Large Language Models (LLMs) of code to generate unit tests. While the code coverage of generated tests was usually assessed, the literature has acknowledged that the coverage is weakly correlated with the efficiency of tests in bug detection. To improve over this limitation, in this paper, we introduce MuTAP for improving the effectiveness of test cases generated by LLMs in terms of revealing bugs by leveraging mutation testing. Our goal is achieved by augmenting prompts with surviving mutants, as those mutants highlight the limitations of test cases in detecting bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate their performance on different benchmarks. Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets. Among these, 17% remained undetected by both the current state-of-the-art fully automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57% on synthetic buggy code, outperforming all other approaches in our evaluation. Our findings suggest that although LLMs can serve as a useful tool to generate test cases, they require specific post-processing steps to enhance the effectiveness of the generated test cases which may suffer from syntactic or functional errors and may be ineffective in detecting certain types of bugs and testing corner cases PUTs.Comment: 16 pages, 3 figure

    Revisiting test smells in automatically generated tests : limitations, pitfalls, and opportunities

    Get PDF
    ​© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Test smells attempt to capture design issues in test code that reduce their maintainability. Previous work found such smells to be highly common in automatically generated test-cases, but based this result on specific static detection rules; although these are based on the original definition of "test smells", a recent empirical study showed that developers perceive these as overly strict and non-representative of the maintainability and quality of test suites. This leads us to investigate how effective such test smell detection tools are on automatically generated test suites. In this paper, we build a dataset of 2,340 test cases automatically generated by EVOSUITE for 100 Java classes. We performed a multi-stage, cross-validated manual analysis to identify six types of test smells and label their instances. We benchmark the performance of two test smell detection tools: one widely used in prior work, and one recently introduced with the express goal to match developer perceptions of test smells. Our results show that these test smell detection strategies poorly characterized the issues in automatically generated test suites; the older tool’s detection strategies, especially, misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice; and (ii) more accurate detection strategies, to be evaluated primarily in industrial contexts

    From software failure to explanation

    Get PDF
    “Why does my program crash?”—This ever recurring question drives the developer both when trying to reconstruct a failure that happened in the field and during the analysis and debugging of the test case that captures the failure. This is the question this thesis attempts to answer. For that I will present two approaches which, when combined, start off with only a dump of the memory at the moment of the crash (a core dump) and eventually give a full explanation of the failure in terms of the important runtime features of the program such as critical branches, state predicates or any other execution aspect that is deemed helpful for understanding the underlying problem. The first approach (called RECORE) takes a core dump of a crash and by means of search-based test case generation comes up with a small, self-contained and easy to understand unit test that is similar to the test as it is attached to a bug report and reproduces the failure. This test case can server as a starting point for analysis and manual debugging. Our evaluation shows that in five out of seven real cases, the resulting test captures the essence of the failure. But this failing test case can also serve as the starting point for the second approach (called BUGEX). BUGEX is a universal debugging framework that applies the scientific method and can be implemented for arbitrary runtime features (called facts). First it observes those facts during the execution of the failing test case. Using state-of-the-art statistical debugging, these facts are then correlated to the failure, forming a hypothesis. Then it performs experiments: it generates additional executions to challenge these facts and from these additional observations refines the hypothesis. The result is a correlation of critical execution aspects to the failure with unprecedented accuracy and instantaneously point the developer to the problem. This general debugging framework can be implemented for any runtime aspects; for evaluation purposes I implemented it for branches and state predicates. The evaluation shows that in six out of seven real cases, the resulting facts pinpoint the failure. Both approaches are independent form one another and each automates a tedious and error prone task. When being combined, they automate a large part of the debugging process, where the remaining manual task—fixing the defect—can never be fully automated.“Warum stürzt mein Programm ab?” – Diese ewig wiederkehrende Frage beschäftigt den Entwickler, sowohl beim Versuch den Fehler so zu rekonstruieren wie er beim Benutzer auftrat, als auch bei der Analyse und beim Debuggen des automatisierten Testfalles der den Fehler auslöst. Und dies ist auch die Frage, die diese Thesis zu beantworten versucht. Dazu präsentiere ich zwei Ansätze, die, wenn man sie kombiniert, als Eingabe lediglich einen Speicherabzug (“core dump”) im Augenblick des Absturzes haben, und als Endergebnis eine Erklärung des Absturzes in Form von wichtigen Ausführungseigenschaften des Programmes liefert (wie z.B. Zweige, Zustandsprädikate oder jedes andere Merkmal der Programmausführung das für das Fehlerverständnis hilfreich sein könnte). Der erste Ansatz (namens RECORE) nimmt einen Speicherabzug, der beim Absturz erstellt wurde, und generiert mittels suchbasierter Testfallerzeugung einen kleinen, leicht verständlichen und in sich abgeschlossenen Testfall, der denen die den Fehlerberichten (“bug reports”) beigefügt sind ähnelt und den Fehler reproduziert. Dieser Testfall kann als Ausgangspunkt der Analyse und zum manuellem Debuggen dienen. Unsere Evaluation zeigt, dass in fünf von sieben Fällen der erzeugte Testfall den Absturz erfolgreich nachstellt. Dieser fehlschlagende Testfall kann aber auch als Ausgangspunkt für den zweiten Ansatz (namens BUGEX) dienen. BUGEX ist ein universelles Rahmenwerk, das die wissenschaftliche Methode verwendet und für beliebige Ausführungsmerkmale des Programmes implementiert werden kann. Zuerst wird der fehlschlagende Testfall bezüglich dieser Merkmale beobachtet, d.h. die Merkmale werden aufgezeichnet. Dann werden aktuelle Methoden des Statistischen Debugging verwendet, um die Merkmale mit dem Testfall zu korrelieren, also um eine Hypothese zu bilden. Anschließend werden Experimente ausgeführt: BUGEX generiert zusätzliche Programmausführungen um diese Korrelation zu prüfen und die Hypothese zu verfeinern. Das Ergebnis ist eine Korrelation zwischen kritischen Ausführungseigenschaften und dem Fehlschlagen des Programmes mit beispielloser Genauigkeit. Die entsprechenden Merkmale zeigen dem Entwickler unmittelbar das Problem auf. Dieses allgemeine Rahmenwerk kann für beliebige Ausführungsmerkmale implementiert werden. Zu Evaluationszwecken habe ich es für Programmzweige und Zustandsprädikate implementiert. Die Evaluation zeigt, dass in sechs von sieben realen Fällen die resultierenden Merkmale den Fehler genau bestimmen. Beide Ansätze sind unabhängig von einander und jeder automatisiert eine mühsame und fehleranfällige Aufgabe. Wenn man sie kombiniert automatisieren sie einen großteil des Debugging Prozesses. Die verbleibende manuelle Aufgabe – den zu Fehler beheben – kann nie vollständig automatisiert werden

    Random or Evolutionary Search for Object-Oriented Test Suite Generation?

    Get PDF
    An important aim in software testing is constructing a test suite with high structural code coverage – that is, ensuring that most if not all of the code under test has been executed by the test cases comprising the test suite. Several search-based techniques have proved successful at automatically generating tests that achieve high coverage. However, despite the well-established arguments behind using evolutionary search algorithms (e.g., genetic algorithms) in preference to random search, it remains an open question whether the benefits can actually be observed in practice when generating unit test suites for object-oriented classes. In this paper, we report an empirical study on the effects of using evolutionary algorithms (including a genetic algorithm and chemical reaction optimization) to generate test suites, compared with generating test suites incrementally with random search. We apply the EVOSUITE unit test suite generator to 1,000 classes randomly selected from the SF110 corpus of open source projects. Surprisingly, the results show that the difference is much smaller than one might expect: While evolutionary search covers more branches of the type where standard fitness functions provide guidance, we observed that, in practice, the vast majority of branches do not provide any guidance to the search. These results suggest that, although evolutionary algorithms are more effective at covering complex branches, a random search may suffice to achieve high coverage of most object-oriented classes

    Demystifying security and compatibility issues in Android Apps

    Full text link
    Never before has any OS been so popular as Android. Existing mobile phones are not simply devices for making phone calls and receiving SMS messages, but powerful communication and entertainment platforms for web surfing, social networking, etc. Even though the Android OS offers powerful communication and application execution capabilities, it is riddled with defects (e.g., security risks, and compatibility issues), new vulnerabilities come to light daily, and bugs cost the economy tens of billions of dollars annually. For example, malicious apps (e.g., back-doors, fraud apps, ransomware, spyware, etc.) are reported [Google, 2022] to exhibit malicious behaviours, including privacy stealing, unwanted programs installed, etc. To counteract these threats, many works have been proposed that rely on static analysis techniques to detect such issues. However, static techniques are not sufficient on their own to detect such defects precisely. This will likely yield false positive results as static analysis has to make some trade-offs when handling complicated cases (e.g., object-sensitive vs. object-insensitive). In addition, static analysis techniques will also likely suffer from soundness issues because some complicated features (e.g., reflection, obfuscation, and hardening) are difficult to be handled [Sun et al., 2021b, Samhi et al., 2022].Comment: Thesi
    • …
    corecore