468,949 research outputs found
JUGE: An Infrastructure for Benchmarking Java Unit Test Generators
Researchers and practitioners have designed and implemented various automated
test case generators to support effective software testing. Such generators
exist for various languages (e.g., Java, C#, or Python) and for various
platforms (e.g., desktop, web, or mobile applications). Such generators exhibit
varying effectiveness and efficiency, depending on the testing goals they aim
to satisfy (e.g., unit-testing of libraries vs. system-testing of entire
applications) and the underlying techniques they implement. In this context,
practitioners need to be able to compare different generators to identify the
most suited one for their requirements, while researchers seek to identify
future research directions. This can be achieved through the systematic
execution of large-scale evaluations of different generators. However, the
execution of such empirical evaluations is not trivial and requires a
substantial effort to collect benchmarks, setup the evaluation infrastructure,
and collect and analyse the results. In this paper, we present our JUnit
Generation benchmarking infrastructure (JUGE) supporting generators (e.g.,
search-based, random-based, symbolic execution, etc.) seeking to automate the
production of unit tests for various purposes (e.g., validation, regression
testing, fault localization, etc.). The primary goal is to reduce the overall
effort, ease the comparison of several generators, and enhance the knowledge
transfer between academia and industry by standardizing the evaluation and
comparison process. Since 2013, eight editions of a unit testing tool
competition, co-located with the Search-Based Software Testing Workshop, have
taken place and used and updated JUGE. As a result, an increasing amount of
tools (over ten) from both academia and industry have been evaluated on JUGE,
matured over the years, and allowed the identification of future research
directions
Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
Statistical significance testing is widely accepted as a means to assess how
well a difference in effectiveness reflects an actual difference between
systems, as opposed to random noise because of the selection of topics.
According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is
the most popular choice among IR researchers. However, previous work has
suggested computer intensive tests like the bootstrap or the permutation test,
based mainly on theoretical arguments. On empirical grounds, others have
suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the
question of which tests we should use has accompanied IR and related fields for
decades now. Previous theoretical studies on this matter were limited in that
we know that test assumptions are not met in IR experiments, and empirical
studies were limited in that we do not have the necessary control over the null
hypotheses to compute actual Type I and Type II error rates under realistic
conditions. Therefore, not only is it unclear which test to use, but also how
much trust we should put in them. In contrast to past studies, in this paper we
employ a recent simulation methodology from TREC data to go around these
limitations. Our study comprises over 500 million p-values computed for a range
of tests, systems, effectiveness measures, topic set sizes and effect sizes,
and for both the 2-tail and 1-tail cases. Having such a large supply of IR
evaluation data with full knowledge of the null hypotheses, we are finally in a
position to evaluate how well statistical significance tests really behave with
IR data, and make sound recommendations for practitioners.Comment: 10 pages, 6 figures, SIGIR 201
Empirical evaluation of the fault detection effectiveness and test effort efficiency of the automated AOP testing approaches.
Testing process is a time-consuming, expensive, and labor-intensive activity in any software setting including aspect-oriented programming (AOP). To reduce the testing costs, human effort, and to achieve the improvements in both quality and productivity of AOP, it is desirable to automate testing of aspect-oriented programs as much as possible. In recent past years, a lot of research effort has been devoted to testing aspect-oriented programs but less effort has been dedicated to the automated AOP testing. This denotes that the current research on automated AOP testing is not sufficient and is still in a stage of infancy. In order to advance the state of the research in this area and to provide testers of AOP-based projects with a comparison basis, a detailed evaluation of the current automated AOP testing approaches in a thorough and experimental manner is required. Thus, the objective of this paper is to provide such evaluation of the current approaches. In this paper, we carry out an empirical study based on mutation analysis to examine four (namely Wrasp, Aspectra, Raspect, and EAT) existing automated AOP testing approaches, particularly their underlying test input generation and selection strategies, with regard to fault detection effectiveness. In addition, the approaches are compared in terms of required effort in detecting faults as part of efficiency evaluation. The experimental results and comparison provided insights into the effectiveness and efficiency of automated AOP testing with their respective strengths and weaknesses. Results showed that EAT is more effective than the other automated AOP testing approaches but not significant for all approaches. EAT was found to be significantly better than Wrasp at 95% confidence level (i.e. p<0.05), but not significantly better than Aspectra or Raspect. Concerning the test effort efficiency, Wrasp was significantly (p<0.05) efficient with requiring the lowest amount of test effort compared to the other approaches. Whereas, EAT showed to be not very efficient by recording the highest amount of test effort. This implies that EAT can currently be the most effective automated AOP testing approach but perhaps less efficient. More generally, search-based testing (as underlying strategy of EAT approach) might achieve better effectiveness but at the cost of greater test effort compared to random testing (as underlying strategy of other approaches)
Machine Learning Data Suitability and Performance Testing Using Fault Injection Testing Framework
Creating resilient machine learning (ML) systems has become necessary to
ensure production-ready ML systems that acquire user confidence seamlessly. The
quality of the input data and the model highly influence the successful
end-to-end testing in data-sensitive systems. However, the testing approaches
of input data are not as systematic and are few compared to model testing. To
address this gap, this paper presents the Fault Injection for Undesirable
Learning in input Data (FIUL-Data) testing framework that tests the resilience
of ML models to multiple intentionally-triggered data faults. Data mutators
explore vulnerabilities of ML systems against the effects of different fault
injections. The proposed framework is designed based on three main ideas: The
mutators are not random; one data mutator is applied at an instance of time,
and the selected ML models are optimized beforehand. This paper evaluates the
FIUL-Data framework using data from analytical chemistry, comprising retention
time measurements of anti-sense oligonucleotide. Empirical evaluation is
carried out in a two-step process in which the responses of selected ML models
to data mutation are analyzed individually and then compared with each other.
The results show that the FIUL-Data framework allows the evaluation of the
resilience of ML models. In most experiments cases, ML models show higher
resilience at larger training datasets, where gradient boost performed better
than support vector regression in smaller training sets. Overall, the mean
squared error metric is useful in evaluating the resilience of models due to
its higher sensitivity to data mutation.Comment: 18 page
Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing
A fundamental problem of causal discovery is cause-effect inference, learning
the correct causal direction between two random variables. Significant progress
has been made through modelling the effect as a function of its cause and a
noise term, which allows us to leverage assumptions about the generating
function class. The recently introduced heteroscedastic location-scale noise
functional models (LSNMs) combine expressive power with identifiability
guarantees. LSNM model selection based on maximizing likelihood achieves
state-of-the-art accuracy, when the noise distributions are correctly
specified. However, through an extensive empirical evaluation, we demonstrate
that the accuracy deteriorates sharply when the form of the noise distribution
is misspecified by the user. Our analysis shows that the failure occurs mainly
when the conditional variance in the anti-causal direction is smaller than that
in the causal direction. As an alternative, we find that causal model selection
through residual independence testing is much more robust to noise
misspecification and misleading conditional variance.Comment: preprin
Empirical Evaluation of Mutation-based Test Prioritization Techniques
We propose a new test case prioritization technique that combines both
mutation-based and diversity-based approaches. Our diversity-aware
mutation-based technique relies on the notion of mutant distinguishment, which
aims to distinguish one mutant's behavior from another, rather than from the
original program. We empirically investigate the relative cost and
effectiveness of the mutation-based prioritization techniques (i.e., using both
the traditional mutant kill and the proposed mutant distinguishment) with 352
real faults and 553,477 developer-written test cases. The empirical evaluation
considers both the traditional and the diversity-aware mutation criteria in
various settings: single-objective greedy, hybrid, and multi-objective
optimization. The results show that there is no single dominant technique
across all the studied faults. To this end, \rev{we we show when and the reason
why each one of the mutation-based prioritization criteria performs poorly,
using a graphical model called Mutant Distinguishment Graph (MDG) that
demonstrates the distribution of the fault detecting test cases with respect to
mutant kills and distinguishment
- …