63 research outputs found

    Identifying and Explaining Safety-critical Scenarios for Autonomous Vehicles via Key Features

    Full text link
    Ensuring the safety of autonomous vehicles (AVs) is of utmost importance and testing them in simulated environments is a safer option than conducting in-field operational tests. However, generating an exhaustive test suite to identify critical test scenarios is computationally expensive as the representation of each test is complex and contains various dynamic and static features, such as the AV under test, road participants (vehicles, pedestrians, and static obstacles), environmental factors (weather and light), and the road's structural features (lanes, turns, road speed, etc.). In this paper, we present a systematic technique that uses Instance Space Analysis (ISA) to identify the significant features of test scenarios that affect their ability to reveal the unsafe behaviour of AVs. ISA identifies the features that best differentiate safety-critical scenarios from normal driving and visualises the impact of these features on test scenario outcomes (safe/unsafe) in 2D. This visualization helps to identify untested regions of the instance space and provides an indicator of the quality of the test suite in terms of the percentage of feature space covered by testing. To test the predictive ability of the identified features, we train five Machine Learning classifiers to classify test scenarios as safe or unsafe. The high precision, recall, and F1 scores indicate that our proposed approach is effective in predicting the outcome of a test scenario without executing it and can be used for test generation, selection, and prioritization.Comment: 28 pages, 6 figure

    Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles

    Full text link
    AI-powered systems have gained widespread popularity in various domains, including Autonomous Vehicles (AVs). However, ensuring their reliability and safety is challenging due to their complex nature. Conventional test adequacy metrics, designed to evaluate the effectiveness of traditional software testing, are often insufficient or impractical for these systems. White-box metrics, which are specifically designed for these systems, leverage neuron coverage information. These coverage metrics necessitate access to the underlying AI model and training data, which may not always be available. Furthermore, the existing adequacy metrics exhibit weak correlations with the ability to detect faults in the generated test suite, creating a gap that we aim to bridge in this study. In this paper, we introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics, which can be used to gauge the effectiveness of a test suite. The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing. Additionally, we introduce a framework that permits testers to visualise the diversity and coverage of the test suite in a two-dimensional space, facilitating the identification of areas that require improvement. We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs. A strong correlation, coupled with the short computation time, indicates their effectiveness and efficiency in estimating the adequacy of testing AVs.Comment: 12 pages, 7 figure

    Closing the Loop for Software Remodularisation -- REARRANGE: An Effort Estimation Approach for Software Clustering-based Remodularisation

    Full text link
    Software remodularization through clustering is a common practice to improve internal software quality. However, the true benefit of software clustering is only realized if developers follow through with the recommended refactoring suggestions, which can be complex and time-consuming. Simply producing clustering results is not enough to realize the benefits of remodularization. For the recommended refactoring operations to have an impact, developers must follow through with them. However, this is often a difficult task due to certain refactoring operations' complexity and time-consuming nature.Comment: Accepted for publication at ICSE23 Poster Trac

    Instance Space Analysis of Search-Based Software Testing

    Full text link
    Search-based software testing (SBST) is now a mature area, with numerous techniques developed to tackle the challenging task of software testing. SBST techniques have shown promising results and have been successfully applied in the industry to automatically generate test cases for large and complex software systems. Their effectiveness, however, is problem-dependent. In this paper, we revisit the problem of objective performance evaluation of SBST techniques considering recent methodological advances -- in the form of Instance Space Analysis (ISA) -- enabling the strengths and weaknesses of SBST techniques to be visualized and assessed across the broadest possible space of problem instances (software classes) from common benchmark datasets. We identify features of SBST problems that explain why a particular instance is hard for an SBST technique, reveal areas of hard and easy problems in the instance space of existing benchmark datasets, and identify the strengths and weaknesses of state-of-the-art SBST techniques. In addition, we examine the diversity and quality of common benchmark datasets used in experimental evaluations

    The Patch Overfitting Problem in Automated Program Repair: Practical Magnitude and a Baseline for Realistic Benchmarking

    Get PDF
    Automated program repair techniques aim to generate patches for software bugs, mainly relying on testing to check their validity. The generation of a large number of such plausible yet incorrect patches is widely believed to hinder wider application of APR in practice, which has motivated research in automated patch assessment. We reflect on the validity of this motivation and carry out an empirical study to analyse the extent to which 10 APR tools suffer from the overfitting problem in practice. We observe that the number of plausible patches generated by any of the APR tools analysed for a given bug from the Defects4J dataset is remarkably low, a median of 2, indicating that a developer only needs to consider 2 patches in most cases to be confident to find a fix or confirming its nonexistence. This study unveils that the overfitting problem might not be as bad as previously thought. We reflect on current evaluation strategies of automated patch assessment techniques and propose a Random Selection baseline to assess whether and when using such techniques is beneficial for reducing human effort. We advocate future work should evaluate the benefit arising from patch overfitting assessment usage against the random baseline
    • …
    corecore