268 research outputs found
Fairness Testing: Testing Software for Discrimination
This paper defines software fairness and discrimination and develops a
testing-based method for measuring if and how much software discriminates,
focusing on causality in discriminatory behavior. Evidence of software
discrimination has been found in modern software systems that recommend
criminal sentences, grant access to financial products, and determine who is
allowed to participate in promotions. Our approach, Themis, generates efficient
test suites to measure discrimination. Given a schema describing valid system
inputs, Themis generates discrimination tests automatically and does not
require an oracle. We evaluate Themis on 20 software systems, 12 of which come
from prior work with explicit focus on avoiding discrimination. We find that
(1) Themis is effective at discovering software discrimination, (2)
state-of-the-art techniques for removing discrimination from algorithms fail in
many situations, at times discriminating against as much as 98% of an input
subdomain, (3) Themis optimizations are effective at producing efficient test
suites for measuring discrimination, and (4) Themis is more efficient on
systems that exhibit more discrimination. We thus demonstrate that fairness
testing is a critical aspect of the software development cycle in domains with
possible discrimination and provide initial tools for measuring software
discrimination.Comment: Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness
Testing: Testing Software for Discrimination. In Proceedings of 2017 11th
Joint Meeting of the European Software Engineering Conference and the ACM
SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE),
Paderborn, Germany, September 4-8, 2017 (ESEC/FSE'17).
https://doi.org/10.1145/3106237.3106277, ESEC/FSE, 201
Angels and monsters: An empirical investigation of potential test effectiveness and efficiency improvement from strongly subsuming higher order mutation
We study the simultaneous test effectiveness and efficiency improvement achievable by Strongly Subsuming Higher Order Mutants (SSHOMs), constructed from 15,792 first order mutants in four Java programs. Using SSHOMs in place of the first order mutants they subsume yielded a 35%-45% reduction in the number of mutants required, while simultaneously improving test efficiency by 15% and effectiveness by between 5.6% and 12%. Trivial first order faults often combine to form exceptionally non-trivial higher order faults; apparently innocuous angels can combine to breed monsters. Nevertheless, these same monsters can be recruited to improve automated test effectiveness and efficiency
Threats to the validity of mutation-based test assessment
Much research on software testing and test techniques relies on experimental studies based on mutation testing. In this paper we reveal that such studies are vulnerable to a potential threat to validity, leading to possible Type I errors; incorrectly rejecting the Null Hypothesis. Our findings indicate that Type I errors occur, for arbitrary experiments that fail to take countermeasures, approximately 62% of the time. Clearly, a Type I error would potentially compromise any scientific conclusion. We show that the problem derives from such studies’ combined use of both subsuming and subsumed mutants. We collected articles published in the last two years at three leading software engineering conferences. Of those that use mutation-based test assessment, we found that 68% are vulnerable to this threat to validity
Automated Repair of Feature Interaction Failures in Automated Driving Systems
In the past years, several automated repair strategies have been
proposed to fix bugs in individual software programs without any
human intervention. There has been, however, little work on how
automated repair techniques can resolve failures that arise at the
system-level and are caused by undesired interactions among different
system components or functions. Feature interaction failures
are common in complex systems such as autonomous cars that are
typically built as a composition of independent features (i.e., units
of functionality). In this paper, we propose a repair technique to
automatically resolve undesired feature interaction failures in automated
driving systems (ADS) that lead to the violation of system
safety requirements. Our repair strategy achieves its goal by (1) localizing
faults spanning several lines of code, (2) simultaneously
resolving multiple interaction failures caused by independent faults,
(3) scaling repair strategies from the unit-level to the system-level,
and (4) resolving failures based on their order of severity. We have
evaluated our approach using two industrial ADS containing four
features. Our results show that our repair strategy resolves the
undesired interaction failures in these two systems in less than 16h
and outperforms existing automated repair techniques
Combining Model Checking and Testing
Abstract Model checking and testing have a lot in common. Over the last two decades, significant progress has been made on how to broaden the scope of model checking from finite-state abstractions to actual software implementations. One way to do this consists of adapting model checking into a form of systematic testing that is applicable to industrial-size software. This chapter presents an overview of this strand of software model checking
- …