327 research outputs found
Boundary State Generation for Testing and Improvement of Autonomous Driving Systems
Recent advances in Deep Neural Networks (DNNs) and sensor technologies are
enabling autonomous driving systems (ADSs) with an ever-increasing level of
autonomy. However, assessing their dependability remains a critical concern.
State-of-the-art ADS testing approaches modify the controllable attributes of a
simulated driving environment until the ADS misbehaves. Such approaches have
two main drawbacks: (1) modifications to the simulated environment might not be
easily transferable to the in-field test setting (e.g., changing the road
shape); (2) environment instances in which the ADS is successful are discarded,
despite the possibility that they could contain hidden driving conditions in
which the ADS may misbehave.
In this paper, we present GenBo (GENerator of BOundary state pairs), a novel
test generator for ADS testing. GenBo mutates the driving conditions of the ego
vehicle (position, velocity and orientation), collected in a failure-free
environment instance, and efficiently generates challenging driving conditions
at the behavior boundary (i.e., where the model starts to misbehave) in the
same environment. We use such boundary conditions to augment the initial
training dataset and retrain the DNN model under test. Our evaluation results
show that the retrained model has up to 16 higher success rate on a separate
set of evaluation tracks with respect to the original DNN model
Testing of Deep Reinforcement Learning Agents with Surrogate Models
Deep Reinforcement Learning (DRL) has received a lot of attention from the
research community in recent years. As the technology moves away from game
playing to practical contexts, such as autonomous vehicles and robotics, it is
crucial to evaluate the quality of DRL agents. In this paper, we propose a
search-based approach to test such agents. Our approach, implemented in a tool
called Indago, trains a classifier on failure and non-failure environment
(i.e., pass) configurations resulting from the DRL training process. The
classifier is used at testing time as a surrogate model for the DRL agent
execution in the environment, predicting the extent to which a given
environment configuration induces a failure of the DRL agent under test. The
failure prediction acts as a fitness function, guiding the generation towards
failure environment configurations, while saving computation time by deferring
the execution of the DRL agent in the environment to those configurations that
are more likely to expose failures. Experimental results show that our
search-based approach finds 50% more failures of the DRL agent than
state-of-the-art techniques. Moreover, such failures are, on average, 78% more
diverse; similarly, the behaviors of the DRL agent induced by failure
configurations are 74% more diverse
Hypertesting of Programs: Theoretical Foundation and Automated Test Generation
Hyperproperties are used to define correctness requirements that involve relations between multiple program executions. This allows, for instance, to model security and concurrency requirements, which cannot be expressed by means of trace properties. In this paper, we propose a novel systematic approach for automated testing of hyperproperties. Our contribution is both foundational and practical. On the foundational side, we define a hypertesting framework, which includes a novel hypercoverage adequacy criterion designed to guide the synthesis of test cases for hyperproperties. On the practical side, we instantiate such framework by implementing HyperFuzz and HyperEvo, two test generators targeting the Non-Interference security requirement, that rely respectively on fuzzing and search algorithms. Experimental results show that the proposed hypercoverage adequacy criterion correlates with the capability of a hypertest to expose hyperproperty violations and that both HyperFuzz and HyperEvo achieve high hypercoverage and high vulnerability exposure with no false alarms (by construction). While they both outperform the state-of-the-art dynamic taint analysis tool Phosphor, HyperEvo is more effective than HyperFuzz on some benchmark programs
Search based path and input data generation for web application testing
Test case generation for web applications aims at ensuring full coverage of the navigation structure. Existing approaches resort to crawling and manual/random input generation, with or without a preliminary construction of the navigation model. However, crawlers might be unable to reach some parts of the web application and random input generation might not receive enough guidance to produce the inputs needed to cover a given path. In this paper, we take advantage of the navigation structure implicitly specified by developers when they write the page objects used for web testing and we define a novel set of genetic operators that support the joint generation of test inputs and feasible navigation paths. On a case study, our tool Subweb was able to achieve higher coverage of the navigation model than crawling based approaches, thanks to its intrinsic ability of generating inputs for feasible paths and of discarding likely infeasible paths
Generating and Detecting True Ambiguity: A Forgotten Danger in DNN Supervision Testing
Deep Neural Networks (DNNs) are becoming a crucial component of modern
software systems, but they are prone to fail under conditions that are
different from the ones observed during training (out-of-distribution inputs)
or on inputs that are truly ambiguous, i.e., inputs that admit multiple classes
with nonzero probability in their labels. Recent work proposed DNN supervisors
to detect high-uncertainty inputs before their possible misclassification leads
to any harm. To test and compare the capabilities of DNN supervisors,
researchers proposed test generation techniques, to focus the testing effort on
high-uncertainty inputs that should be recognized as anomalous by supervisors.
However, existing test generators aim to produce out-of-distribution inputs. No
existing model- and supervisor independent technique targets the generation of
truly ambiguous test inputs, i.e., inputs that admit multiple classes according
to expert human judgment.
In this paper, we propose a novel way to generate ambiguous inputs to test
DNN supervisors and used it to empirically compare several existing supervisor
techniques. In particular, we propose AmbiGuess to generate ambiguous samples
for image classification problems. AmbiGuess is based on gradient-guided
sampling in the latent space of a regularized adversarial autoencoder.
Moreover, we conducted what is -- to the best of our knowledge -- the most
extensive comparative study of DNN supervisors, considering their capabilities
to detect 4 distinct types of high-uncertainty inputs, including truly
ambiguous ones. We find that the tested supervisors' capabilities are
complementary: Those best suited to detect true ambiguity perform worse on
invalid, out-of-distribution and adversarial inputs and vice-versa.Comment: Accepted for publication at Springers "Empirical Software
Engineering" (EMSE
Assessment of Source Code Obfuscation Techniques
Obfuscation techniques are a general category of software protections widely
adopted to prevent malicious tampering of the code by making applications more
difficult to understand and thus harder to modify. Obfuscation techniques are
divided in code and data obfuscation, depending on the protected asset. While
preliminary empirical studies have been conducted to determine the impact of
code obfuscation, our work aims at assessing the effectiveness and efficiency
in preventing attacks of a specific data obfuscation technique - VarMerge. We
conducted an experiment with student participants performing two attack tasks
on clear and obfuscated versions of two applications written in C. The
experiment showed a significant effect of data obfuscation on both the time
required to complete and the successful attack efficiency. An application with
VarMerge reduces by six times the number of successful attacks per unit of
time. This outcome provides a practical clue that can be used when applying
software protections based on data obfuscation.Comment: Post-print, SCAM 201
Using multi-locators to increase the robustness of web test cases
The main reason for the fragility of web test cases is the inability of web element locators to work correctly when the web page DOM evolves. Web elements locators are used in web test cases to identify all the GUI objects to operate upon and eventually to retrieve web page content that is compared against some oracle in order to decide whether the test case has passed or not. Hence, web element locators play an extremely important role in web testing and when a web element locator gets broken developers have to spend substantial time and effort to repair it. While algorithms exist to produce robust web element locators to be used in web test scripts, no algorithm is perfect and different algorithms are exposed to different fragilities when the software evolves. Based on such observation, we propose a new type of locator, named multi-locator, which selects the best locator among a candidate set of locators produced by different algorithms. Such selection is based on a voting procedure that assigns different voting weights to different locator generation algorithms. Experimental results obtained on six web applications, for which a subsequent release was available, show that the multi-locator is more robust than the single locators (about -30% of broken locators w.r.t. the most robust kind of single locator) and that the execution overhead required by the multiple queries done with different locators is negligible (2-3% at most)
Deep Reinforcement Learning for Black-Box Testing of Android Apps
The state space of Android apps is huge and its thorough exploration during
testing remains a major challenge. In fact, the best exploration strategy is
highly dependent on the features of the app under test. Reinforcement Learning
(RL) is a machine learning technique that learns the optimal strategy to solve
a task by trial and error, guided by positive or negative reward, rather than
by explicit supervision. Deep RL is a recent extension of RL that takes
advantage of the learning capabilities of neural networks. Such capabilities
make Deep RL suitable for complex exploration spaces such as the one of Android
apps. However, state of the art, publicly available tools only support basic,
tabular RL. We have developed ARES, a Deep RL approach for black-box testing of
Android apps. Experimental results show that it achieves higher coverage and
fault revelation than the baselines, which include state of the art RL based
tools, such as TimeMachine and Q-Testing. We also investigated qualitatively
the reasons behind such performance and we have identified the key features of
Android apps that make Deep RL particularly effective on them to be the
presence of chained and blocking activities
Why Creating Web Page Objects Manually if It Can Be Done Automatically?
Page Object is a design pattern aimed at making web test scripts more readable, robust and maintainable. The effort to manually create the page objects needed for a web application may be substantial and unfortunately existing tools do not help web developers in such task.In this paper we present APOGEN, a tool for the automatic generation of page objects for web applications. Our tool automatically derives a testing model by reverse engineering the target web application and uses a combination of dynamic and static analysis to generate Java page objects for the popular Selenium WebDriver framework. Our preliminary evaluation shows that it is possible to use around 3/4 of the automatic page object methods as they are, while the remaining 1/4 need only minor modifications
Simulation-based testing of unmanned aerial vehicles with Aerialist
Simulation-based testing is crucial for ensuring the safety and reliability of unmanned aerial vehicles (UAVs), especially as they become more autonomous and get increasingly used in commercial scenarios. The complexity and automated nature of UAVs requires sophisticated simulation environments for effectively testing their safety requirements. The primary challenges in setting up these environments pose significant barriers to the practical, widespread adoption of UAVs. We address this issue by introducing Aerialist (unmanned AERIAL vehIcle teST bench), a novel UAV test bench, built on top of PX4 firmware, that facilitates or automates all the necessary steps of definition, generation, execution, and analysis of system-level UAV test cases in simulation environments. Moreover, it also supports parallel and scalable execution and analysis of test cases on Kubernetes clusters. This makes Aerialist a unique platform for research and development of test generation approaches for UAVs. To evaluate Aerialist’s support for UAV developers in defining, generating, and executing UAV test cases, we implemented a search-based approach for generating realistic simulation-based test cases using real-world UAV flight logs. We confirmed its effectiveness in improving the realism and representativeness of simulation-based UAV tests
- …