63 research outputs found
Identifying and Explaining Safety-critical Scenarios for Autonomous Vehicles via Key Features
Ensuring the safety of autonomous vehicles (AVs) is of utmost importance and
testing them in simulated environments is a safer option than conducting
in-field operational tests. However, generating an exhaustive test suite to
identify critical test scenarios is computationally expensive as the
representation of each test is complex and contains various dynamic and static
features, such as the AV under test, road participants (vehicles, pedestrians,
and static obstacles), environmental factors (weather and light), and the
road's structural features (lanes, turns, road speed, etc.). In this paper, we
present a systematic technique that uses Instance Space Analysis (ISA) to
identify the significant features of test scenarios that affect their ability
to reveal the unsafe behaviour of AVs. ISA identifies the features that best
differentiate safety-critical scenarios from normal driving and visualises the
impact of these features on test scenario outcomes (safe/unsafe) in 2D. This
visualization helps to identify untested regions of the instance space and
provides an indicator of the quality of the test suite in terms of the
percentage of feature space covered by testing. To test the predictive ability
of the identified features, we train five Machine Learning classifiers to
classify test scenarios as safe or unsafe. The high precision, recall, and F1
scores indicate that our proposed approach is effective in predicting the
outcome of a test scenario without executing it and can be used for test
generation, selection, and prioritization.Comment: 28 pages, 6 figure
Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles
AI-powered systems have gained widespread popularity in various domains,
including Autonomous Vehicles (AVs). However, ensuring their reliability and
safety is challenging due to their complex nature. Conventional test adequacy
metrics, designed to evaluate the effectiveness of traditional software
testing, are often insufficient or impractical for these systems. White-box
metrics, which are specifically designed for these systems, leverage neuron
coverage information. These coverage metrics necessitate access to the
underlying AI model and training data, which may not always be available.
Furthermore, the existing adequacy metrics exhibit weak correlations with the
ability to detect faults in the generated test suite, creating a gap that we
aim to bridge in this study.
In this paper, we introduce a set of black-box test adequacy metrics called
"Test suite Instance Space Adequacy" (TISA) metrics, which can be used to gauge
the effectiveness of a test suite. The TISA metrics offer a way to assess both
the diversity and coverage of the test suite and the range of bugs detected
during testing. Additionally, we introduce a framework that permits testers to
visualise the diversity and coverage of the test suite in a two-dimensional
space, facilitating the identification of areas that require improvement.
We evaluate the efficacy of the TISA metrics by examining their correlation
with the number of bugs detected in system-level simulation testing of AVs. A
strong correlation, coupled with the short computation time, indicates their
effectiveness and efficiency in estimating the adequacy of testing AVs.Comment: 12 pages, 7 figure
Closing the Loop for Software Remodularisation -- REARRANGE: An Effort Estimation Approach for Software Clustering-based Remodularisation
Software remodularization through clustering is a common practice to improve
internal software quality. However, the true benefit of software clustering is
only realized if developers follow through with the recommended refactoring
suggestions, which can be complex and time-consuming. Simply producing
clustering results is not enough to realize the benefits of remodularization.
For the recommended refactoring operations to have an impact, developers must
follow through with them. However, this is often a difficult task due to
certain refactoring operations' complexity and time-consuming nature.Comment: Accepted for publication at ICSE23 Poster Trac
Instance Space Analysis of Search-Based Software Testing
Search-based software testing (SBST) is now a mature area, with numerous
techniques developed to tackle the challenging task of software testing. SBST
techniques have shown promising results and have been successfully applied in
the industry to automatically generate test cases for large and complex
software systems. Their effectiveness, however, is problem-dependent. In this
paper, we revisit the problem of objective performance evaluation of SBST
techniques considering recent methodological advances -- in the form of
Instance Space Analysis (ISA) -- enabling the strengths and weaknesses of SBST
techniques to be visualized and assessed across the broadest possible space of
problem instances (software classes) from common benchmark datasets. We
identify features of SBST problems that explain why a particular instance is
hard for an SBST technique, reveal areas of hard and easy problems in the
instance space of existing benchmark datasets, and identify the strengths and
weaknesses of state-of-the-art SBST techniques. In addition, we examine the
diversity and quality of common benchmark datasets used in experimental
evaluations
The Patch Overfitting Problem in Automated Program Repair: Practical Magnitude and a Baseline for Realistic Benchmarking
Automated program repair techniques aim to generate patches for software bugs, mainly relying on testing to check their validity. The generation of a large number of such plausible yet incorrect patches is widely believed to hinder wider application of APR in practice, which has motivated research in automated patch assessment. We reflect on the validity of this motivation and carry out an empirical study to analyse the extent to which 10 APR tools suffer from the overfitting problem in practice. We observe that the number of plausible patches generated by any of the APR tools analysed for a given bug from the Defects4J dataset is remarkably low, a median of 2, indicating that a developer only needs to consider 2 patches in most cases to be confident to find a fix or confirming its nonexistence. This study unveils that the overfitting problem might not be as bad as previously thought. We reflect on current evaluation strategies of automated patch assessment techniques and propose a Random Selection baseline to assess whether and when using such techniques is beneficial for reducing human effort. We advocate future work should evaluate the benefit arising from patch overfitting assessment usage against the random baseline
- …