25 research outputs found
Test them all, is it worth it? Assessing configuration sampling on the JHipster Web development stack
Many approaches for testing configurable software systems start from the same assumption: it is impossible to test all configurations. This motivated the definition of variability-aware abstractions and sampling techniques to cope with large configuration spaces. Yet, there is no theoretical barrier that prevents the exhaustive testing of all configurations by simply enumerating them if the effort required to do so remains acceptable. Not only this: we believe there is a lot to be learned by systematically and exhaustively testing a configurable system. In this case study, we report on the first ever endeavour to test all possible configurations of the industry-strength, open source configurable software system JHipster, a popular code generator for web applications. We built a testing scaffold for the 26,000+ configurations of JHipster using a cluster of 80 machines during 4 nights for a total of 4,376 hours (182 days) CPU time. We find that 35.70% configurations fail and we identify the feature interactions that cause the errors. We show that sampling strategies (like dissimilarity and 2-wise): (1) are more effective to find faults than the 12 default configurations used in the JHipster continuous integration; (2) can be too costly and exceed the available testing budget. We cross this quantitative analysis with the qualitative assessment of JHipsterâs lead developers.</p
Test them all, is it worth it? Assessing configuration sampling on the JHipster Web development stack
Many approaches for testing configurable software systems start from the same
assumption: it is impossible to test all configurations. This motivated the
definition of variability-aware abstractions and sampling techniques to cope
with large configuration spaces. Yet, there is no theoretical barrier that
prevents the exhaustive testing of all configurations by simply enumerating
them, if the effort required to do so remains acceptable. Not only this: we
believe there is lots to be learned by systematically and exhaustively testing
a configurable system. In this case study, we report on the first ever
endeavour to test all possible configurations of an industry-strength, open
source configurable software system, JHipster, a popular code generator for web
applications. We built a testing scaffold for the 26,000+ configurations of
JHipster using a cluster of 80 machines during 4 nights for a total of 4,376
hours (182 days) CPU time. We find that 35.70% configurations fail and we
identify the feature interactions that cause the errors. We show that sampling
strategies (like dissimilarity and 2-wise): (1) are more effective to find
faults than the 12 default configurations used in the JHipster continuous
integration; (2) can be too costly and exceed the available testing budget. We
cross this quantitative analysis with the qualitative assessment of JHipster's
lead developers.Comment: Submitted to Empirical Software Engineerin
Empirical assessment of generating adversarial configurations for software product lines
Software product line (SPL) engineering allows the derivation of products tailored to stakeholdersâ needs through the setting of a large number of configuration options. Unfortunately, options and their interactions create a huge configuration space which is either intractable or too costly to explore exhaustively. Instead of covering all products, machine learning (ML) approximates the set of acceptable products (e.g., successful builds, passing tests) out of a training set (a sample of configurations). However, ML techniques can make prediction errors yielding non-acceptable products wasting time, energy and other resources. We apply adversarial machine learning techniques to the world of SPLs and craft new configurations faking to be acceptable configurations but that are not and vice-versa. It allows to diagnose prediction errors and take appropriate actions. We develop two adversarial configuration generators on top of state-of-the-art attack algorithms and capable of synthesizing configurations that are both adversarial and conform to logical constraints. We empirically assess our generators within two case studies: an industrial video synthesizer (MOTIV) and an industry-strength, open-source Web-app configurator (JHipster). For the two cases, our attacks yield (up to) a 100% misclassification rate without sacrificing the logical validity of adversarial configurations. This work lays the foundations of a quality assurance framework for ML-based SPLs
Empirical Assessment of Generating Adversarial Configurations for Software Product Lines
International audienceSoftware product line (SPL) engineering allows the derivation of products tailored to stakeholders' needs through the setting of a large number of configuration options. Unfortunately, options and their interactions create a huge configuration space which is either intractable or too costly to explore exhaustively. Instead of covering all products, machine learning (ML) approximates the set of acceptable products (e.g., successful builds, passing tests) out of a training set (a sample of configurations). However, ML techniques can make prediction errors yielding non-acceptable products wasting time, energy and other resources. We apply adversarial machine learning techniques to the world of SPLs and craft new configurations faking to be acceptable configurations but that are not and vice-versa. It allows to diagnose prediction errors and take appropriate actions. We develop two adversarial configuration generators on top of state-of-the-art attack algorithms and capable of synthesizing configurations that are both adversarial and conform to logical constraints. We empirically assess our generators within two case studies: an industrial video synthesizer (MOTIV) and an industry-strength, open-source Web-appconfigurator (JHipster). For the two cases, our attacks yield (up to) a 100% misclassification rate without sacrificing the logical validity of adversarial configurations. This work lays the foundations of a quality assurance framework for ML-based SPLs
A Rule-Learning Approach for Detecting Faults in Highly Configurable Software Systems from Uniform Random Samples
Software systems tend to become more and more configurable to satisfy the demands of their increasingly varied customers. Exhaustively testing the correctness of highly configurable software is infeasible in most cases because the space of possible configurations is typically colossal. This paper proposes addressing this challenge by (i) working with a representative sample of the configurations, i.e., a ``uniform'' random sample, and (ii) processing the results of testing the sample with a rule induction system that extracts the faults that cause the tests to fail. The paper (i) gives a concrete implementation of the approach, (ii) compares the performance of the rule learning algorithms AQ, CN2, LEM2, PART, and RIPPER, and (iii) provides empirical evidence supporting our procedure
Uniform and scalable SAT-sampling for configurable systems
Several relevant analyses on configurable software systems remain intractable because they require examining vast and highly-constrained configuration spaces. Those analyses could be addressed through statistical inference, i.e., working with a much more tractable sample that later supports generalizing the results obtained to the entire configuration space. To make this possible, the laws of statistical inference impose an indispensable requirement: each member of the population must be equally likely to be included in the sample, i.e., the sampling process needs to be "uniform". Various SAT-samplers have been developed for generating uniform random samples at a reasonable computational cost. Unfortunately, there is a lack of experimental validation over large configuration models to show whether the samplers indeed produce genuine uniform samples or not. This paper (i) presents a new statistical test to verify to what extent samplers accomplish uniformity and (ii) reports the evaluation of four state-of-the-art samplers: Spur, QuickSampler, Unigen2, and Smarch. According to our experimental results, only Spur satisfies both scalability and uniformity.Ministerio de Ciencia, InnovaciĂłn y Universidades VITAL-3D DPI2016-77677-PMinisterio de Ciencia, InnovaciĂłn y Universidades OPHELIA RTI2018-101204-B-C22Comunidad AutĂłnoma de Madrid CAM RoboCity2030 S2013/MIT-2748Agencia Estatal de InvestigaciĂłn TIN2017-90644-RED
Uniform and scalable sampling of highly configurable systems
Many analyses on confgurable software systems are intractable when confronted with
colossal and highly-constrained confguration spaces. These analyses could instead use
statistical inference, where a tractable sample accurately predicts results for the entire
space. To do so, the laws of statistical inference requires each member of the population
to be equally likely to be included in the sample, i.e., the sampling process needs to be
âuniformâ. SAT-samplers have been developed to generate uniform random samples at a
reasonable computational cost. However, there is a lack of experimental validation over
colossal spaces to show whether the samplers indeed produce uniform samples or not. This
paper (i) proposes a new sampler named BDDSampler, (ii) presents a new statistical test
to verify sampler uniformity, and (iii) reports the evaluation of BDDSampler and fve
other state-of-the-art samplers: KUS, QuickSampler, Smarch, Spur, and Unigen2. Our
experimental results show only BDDSampler satisfes both scalability and uniformity.Universidad Nacional de EducaciĂłn a Distancia (UNED) OPTIVAC 096-034091 2021V/PUNED/008Ministerio de Ciencia, InnovaciĂłn y Universidades RTI2018-101204-B-C22 (OPHELIA)Comunidad AutĂłnoma de Madrid ROBOCITY2030-DIH-CM S2018/NMT-4331Agencia Estatal de InvestigaciĂłn TIN2017-90644-RED
Uniform Sampling of SAT Solutions for Configurable Systems: Are We There Yet?
International audienceUniform or near-uniform generation of solutions for large satisfiability formulas is a problem of theoretical and practical interest for the testing community. Recent works proposed two algorithms (namely UniGen and QuickSampler) for reaching a good compromise between execution time and uniformity guarantees, with empirical evidence on SAT benchmarks. In the context of highly-configurable software systems (e.g., Linux), it is unclear whether UniGen and QuickSampler can scale and sample uniform software configurations. In this paper, we perform a thorough experiment on 128 real-world feature models. We find that UniGen is unable to produce SAT solutions out of such feature models. Furthermore, we show that QuickSampler does not generate uniform samples and that some features are either never part of the sample or too frequently present. Finally, using a case study, we characterize the impacts of these results on the ability to find bugs in a configurable system. Overall, our results suggest that we are not there: more research is needed to explore the cost-effectiveness of uniform sampling when testing large configurable systems
Towards quality assurance of software product lines with adversarial configurations
International audienceSoftware product line (SPL) engineers put a lot of effort to ensure that, through the setting of a large number of possible configuration options, products are acceptable and well-tailored to customersâ needs. Unfortunately, options and their mutual interactions create a huge configuration space which is intractable to exhaustively explore. Instead of testing all products, machine learning is increasingly employed to approximate the set of acceptable products out of a small training sample of configurations. Machine learning (ML) techniques can refine a software product line through learned constraints and a priori prevent non-acceptable products to be derived. In this paper, we use adversarial ML techniques to generate adversarial configurations fooling ML classifiers and pinpoint incorrect classifications of products (videos) derived from an industrial video generator. Our attacks yield (up to) a 100% misclassification rate and a drop in accuracy of 5%. We discuss the implications these results have on SPL quality assurance