12 research outputs found

    Systematically Covering Input Structure

    Get PDF
    Grammar-based testing uses a given grammar to produce syntactically valid inputs. To cover program features, it is necessary to also cover input features - say, all URL variants for a URL parser. Our k-path algorithm for grammar production systematically covers syntactic elements as well as their combinations. In our evaluation, we show that this results in a significantly higher code coverage than state of the art

    From Input Coverage to Code Coverage: Systematically Covering Input Structure with k-Paths

    Get PDF
    Grammar-based testing uses a given grammar to produce syntactically valid inputs. Intuitively, to cover program features, it is necessary to also cover input features. We present a measure of input coverage called k-path coverage, which takes into account the coverage of individual syntactic elements as well as their combinations up to a given depth k. A k-path coverage with k = 1 prescribes that all individual symbols be covered; k-path coverage with k = 2 dictates that all symbols in the context of all their parents be covered; and so on. Using the k-path measure, we make a number of contributions. (1) We provide an \emph{algorithm for grammar-based production} that constructively covers a given k-path measure. In our evaluation, using k-path during production results in a significantly higher code coverage than state-of-the-art approaches that ignore input coverage. (2) We show on a selection of real-world subjects that coverage of input elements, as measured by k-path, correlates with code coverage. As a consequence, k-path coverage can also be used to predict code coverage. (3) We show that one can learn associations between individual k-path features and coverage of specific locations: "Method `distrule` is invoked whenever both `+` and `*` occur in an expression." Developers can interpret these associations to create suitable inputs that focus on selected methods, or have tools generate inputs that immediately target these methods. The above approaches have been implemented in the \tribble and \codeine prototypes, and evaluated on a number of processors for JSON, CSV, URLs, and Markdown. All tools and data are available as open source

    Grammar-based fuzzing of data integration parsers in computational materials science

    Get PDF
    Context Computational materials science (CMS) focuses on in silico experiments to compute the properties of known and novel materials, where many software packages are used in the community. The NOMAD Laboratory (Draxl C, Scheffler) offers to store the input and output files in its FAIR data repository. Since the file formats of these software packages are non-standardized, parsers are used to provide the results in a normalized format. Objective The main goal of this article is to report experience and findings of using grammar-based fuzzing on these parsers. Method We have constructed an input grammar for four common software packages in the CMS domain and performed an experimental evaluation on the capabilities of grammar-based fuzzing to detect failures in the Novel Materials Discovery (NOMAD) parsers. Results With our approach, we were able to identify three unique critical bugs concerning service availability, as well as several additional syntactic, semantic, logical, and downstream bugs in the investigated NOMAD parsers. We reported all issues to the developer team prior to publication. Conclusion Based on the experience gained, we can recommend grammar-based fuzzing also for other research software packages to improve the trust level in the correctness of the produced results.Peer Reviewe

    Do Judge a Test by its Cover: Combining Combinatorial and Property-Based Testing

    Get PDF
    Property-based testing uses randomly generated inputs to validate high-level program specifications. It can be shockingly effective at finding bugs, but it often requires generating a very large number of inputs to do so. In this paper, we apply ideas from combinatorial testing, a powerful and widely studied testing methodology, to modify the distributions of our random generators so as to find bugs with fewer tests. The key concept is combinatorial coverage, which measures the degree to which a given set of tests exercises every possible choice of values for every small combination of input features. In its “classical” form, combinatorial coverage only applies to programs whose inputs have a very particular shape—essentially, a Cartesian product of finite sets. We generalize combinatorial coverage to the richer world of algebraic data types by formalizing a class of sparse test descriptions based on regular tree expressions. This new definition of coverage inspires a novel combinatorial thinning algorithm for improving the coverage of random test generators, requiring many fewer tests to catch bugs. We evaluate this algorithm on two case studies, a typed evaluator for System F terms and a Haskell compiler, showing significant improvements in both

    Astraea: Grammar-based Fairness Testing

    Get PDF
    Software often produces biased outputs. In particular, machine learning (ML) based software are known to produce erroneous predictions when processing discriminatory inputs. Such unfair program behavior can be caused by societal bias. In the last few years, Amazon, Microsoft and Google have provided software services that produce unfair outputs, mostly due to societal bias (e.g. gender or race). In such events, developers are saddled with the task of conducting fairness testing. Fairness testing is challenging; developers are tasked with generating discriminatory inputs that reveal and explain biases. We propose a grammar-based fairness testing approach (called ASTRAEA) which leverages context-free grammars to generate discriminatory inputs that reveal fairness violations in software systems. Using probabilistic grammars, ASTRAEA also provides fault diagnosis by isolating the cause of observed software bias. ASTRAEA's diagnoses facilitate the improvement of ML fairness. ASTRAEA was evaluated on 18 software systems that provide three major natural language processing (NLP) services. In our evaluation, ASTRAEA generated fairness violations with a rate of ~18%. ASTRAEA generated over 573K discriminatory test cases and found over 102K fairness violations. Furthermore, ASTRAEA improves software fairness by ~76%, via model-retraining

    Learning the language of apps

    Get PDF
    To explore the functionality of an app, automated test generators systematically identify and interact with its user interface (UI) elements. A key challenge is to synthesize inputs which effectively and efficiently cover app behavior. To do so, a test generator has to choose which elements to interact with but, which interactions to do on each element and which input values to type. In summary, to better test apps, a test generator should know the app's language, that is, the language of its graphical interactions and the language of its textual inputs. In this work, we show how a test generator can learn the language of apps and how this knowledge is modeled to create tests. We demonstrate how to learn the language of the graphical input prior to testing by combining machine learning and static analysis, and how to refine this knowledge during testing using reinforcement learning. In our experiments, statically learned models resulted in 50\% less ineffective actions an average increase in test (code) coverage of 19%, while refining these through reinforcement learning resulted in an additional test (code) coverage of up to 20%. We learn the language of textual inputs, by identifying the semantics of input fields in the UI and querying the web for real-world values. In our experiments, real-world values increase test (code) coverage ~10%; Finally, we show how to use context-free grammars to integrate both languages into a single representation (UI grammar), giving back control to the user. This representation can then be: mined from existing tests, associated to the app source code, and used to produce new tests. 82% test cases produced by fuzzing our UI grammar can reach a UI element within the app and 70% of them can reach a specific code location.Automatisierte Testgeneratoren identifizieren systematisch Elemente der Benutzeroberfläche und interagieren mit ihnen, um die Funktionalität einer App zu erkunden. Eine wichtige Herausforderung besteht darin, Eingaben zu synthetisieren, die das App-Verhalten effektiv und effizient abdecken. Dazu muss ein Testgenerator auswählen, mit welchen Elementen interagiert werden soll, welche Interaktionen jedoch für jedes Element ausgeführt werden sollen und welche Eingabewerte eingegeben werden sollen. Um Apps besser testen zu können, sollte ein Testgenerator die Sprache der App kennen, dh die Sprache ihrer grafischen Interaktionen und die Sprache ihrer Texteingaben. In dieser Arbeit zeigen wir, wie ein Testgenerator die Sprache von Apps lernen kann und wie dieses Wissen modelliert wird, um Tests zu erstellen. Wir zeigen, wie die Sprache der grafischen Eingabe lernen vor dem Testen durch maschinelles Lernen und statische Analyse kombiniert und wie dieses Wissen weiter verfeinern beim Testen Verstärkung Lernen verwenden. In unseren Experimenten führten statisch erlernte Modelle zu 50% weniger ineffektiven Aktionen, was einer durchschnittlichen Erhöhung der Testabdeckung (Code) von 19% entspricht, während die Verfeinerung dieser durch verstärkendes Lernen zu einer zusätzlichen Testabdeckung (Code) von bis zu 20% führte. Wir lernen die Sprache der Texteingaben, indem wir die Semantik der Eingabefelder in der Benutzeroberfläche identifizieren und das Web nach realen Werten abfragen. In unseren Experimenten erhöhen reale Werte die Testabdeckung (Code) um ca. 10%; Schließlich zeigen wir, wie kontextfreien Grammatiken verwenden beide Sprachen in einer einzigen Darstellung (UI Grammatik) zu integrieren, wieder die Kontrolle an den Benutzer zu geben. Diese Darstellung kann dann: aus vorhandenen Tests gewonnen, dem App-Quellcode zugeordnet und zur Erstellung neuer Tests verwendet werden. 82% Testfälle, die durch Fuzzing unserer UI-Grammatik erstellt wurden, können ein UI-Element in der App erreichen, und 70% von ihnen können einen bestimmten Code-Speicherort erreichen

    Evidence-driven testing and debugging of software systems

    Get PDF
    Program debugging is the process of testing, exposing, reproducing, diagnosing and fixing software bugs. Many techniques have been proposed to aid developers during software testing and debugging. However, researchers have found that developers hardly use or adopt the proposed techniques in software practice. Evidently, this is because there is a gap between proposed methods and the state of software practice. Most methods fail to address the actual needs of software developers. In this dissertation, we pose the following scientific question: How can we bridge the gap between software practice and the state-of-the-art automated testing and debugging techniques? To address this challenge, we put forward the following thesis: Software testing and debugging should be driven by empirical evidence collected from software practice. In particular, we posit that the feedback from software practice should shape and guide (the automation) of testing and debugging activities. In this thesis, we focus on gathering evidence from software practice by conducting several empirical studies on software testing and debugging activities in the real-world. We then build tools and methods that are well-grounded and driven by the empirical evidence obtained from these experiments. Firstly, we conduct an empirical study on the state of debugging in practice using a survey and a human study. In this study, we ask developers about their debugging needs and observe the tools and strategies employed by developers while testing, diagnosing and repairing real bugs. Secondly, we evaluate the effectiveness of the state-of-the-art automated fault localization (AFL) methods on real bugs and programs. Thirdly, we conducted an experiment to evaluate the causes of invalid inputs in software practice. Lastly, we study how to learn input distributions from real-world sample inputs, using probabilistic grammars. To bridge the gap between software practice and the state of the art in software testing and debugging, we proffer the following empirical results and techniques: (1) We collect evidence on the state of practice in program debugging and indeed, we found that there is a chasm between (available) debugging tools and developer needs. We elicit the actual needs and concerns of developers when testing and diagnosing real faults and provide a benchmark (called DBGBench) to aid the automated evaluation of debugging and repair tools. (2) We provide empirical evidence on the effectiveness of several state-of-the-art AFL techniques (such as statistical debugging formulas and dynamic slicing). Building on the obtained empirical evidence, we provide a hybrid approach that outperforms the state-of-the-art AFL techniques. (3) We evaluate the prevalence and causes of invalid inputs in software practice, and we build on the lessons learned from this experiment to build a general-purpose algorithm (called ddmax) that automatically diagnoses and repairs real-world invalid inputs. (4) We provide a method to learn the distribution of input elements in software practice using probabilistic grammars and we further employ the learned distribution to drive the test generation of inputs that are similar (or dissimilar) to sample inputs found in the wild. In summary, we propose an evidence-driven approach to software testing and debugging, which is based on collecting empirical evidence from software practice to guide and direct software testing and debugging. In our evaluation, we found that our approach is effective in improving the effectiveness of several debugging activities in practice. In particular, using our evidence-driven approach, we elicit the actual debugging needs of developers, improve the effectiveness of several automated fault localization techniques, effectively debug and repair invalid inputs, and generate test inputs that are (dis)similar to real-world inputs. Our proposed methods are built on empirical evidence and they improve over the state-of-the-art techniques in testing and debugging.Software-Debugging bezeichnet das Testen, Aufspüren, Reproduzieren, Diagnostizieren und das Beheben von Fehlern in Programmen. Es wurden bereits viele Debugging-Techniken vorgestellt, die Softwareentwicklern beim Testen und Debuggen unterstützen. Dennoch hat sich in der Forschung gezeigt, dass Entwickler diese Techniken in der Praxis kaum anwenden oder adaptieren. Das könnte daran liegen, dass es einen großen Abstand zwischen den vorgestellten und in der Praxis tatsächlich genutzten Techniken gibt. Die meisten Techniken genügen den Anforderungen der Entwickler nicht. In dieser Dissertation stellen wir die folgende wissenschaftliche Frage: Wie können wir die Kluft zwischen Software-Praxis und den aktuellen wissenschaftlichen Techniken für automatisiertes Testen und Debugging schließen? Um diese Herausforderung anzugehen, stellen wir die folgende These auf: Das Testen und Debuggen von Software sollte von empirischen Daten, die in der Software-Praxis gesammelt wurden, vorangetrieben werden. Genauer gesagt postulieren wir, dass das Feedback aus der Software-Praxis die Automation des Testens und Debuggens formen und bestimmen sollte. In dieser Arbeit fokussieren wir uns auf das Sammeln von Daten aus der Software-Praxis, indem wir einige empirische Studien über das Testen und Debuggen von Software in der echten Welt durchführen. Auf Basis der gesammelten Daten entwickeln wir dann Werkzeuge, die sich auf die Daten der durchgeführten Experimente stützen. Als erstes führen wir eine empirische Studie über den Stand des Debuggens in der Praxis durch, wobei wir eine Umfrage und eine Humanstudie nutzen. In dieser Studie befragen wir Entwickler zu ihren Bedürfnissen, die sie beim Debuggen haben und beobachten die Werkzeuge und Strategien, die sie beim Diagnostizieren, Testen und Aufspüren echter Fehler einsetzen. Als nächstes bewerten wir die Effektivität der aktuellen Automated Fault Localization (AFL)- Methoden zum automatischen Aufspüren von echten Fehlern in echten Programmen. Unser dritter Schritt ist ein Experiment, um die Ursachen von defekten Eingaben in der Software-Praxis zu ermitteln. Zuletzt erforschen wir, wie Häufigkeitsverteilungen von Teileingaben mithilfe einer Grammatik von echten Beispiel-Eingaben aus der Praxis gelernt werden können. Um die Lücke zwischen Software-Praxis und der aktuellen Forschung über Testen und Debuggen von Software zu schließen, bieten wir die folgenden empirischen Ergebnisse und Techniken: (1) Wir sammeln aktuelle Forschungsergebnisse zum Stand des Software-Debuggens und finden in der Tat eine Diskrepanz zwischen (vorhandenen) Debugging-Werkzeugen und dem, was der Entwickler tatsächlich benötigt. Wir sammeln die tatsächlichen Bedürfnisse von Entwicklern beim Testen und Debuggen von Fehlern aus der echten Welt und entwickeln einen Benchmark (DbgBench), um das automatische Evaluieren von Debugging-Werkzeugen zu erleichtern. (2) Wir stellen empirische Daten zur Effektivität einiger aktueller AFL-Techniken vor (z.B. Statistical Debugging-Formeln und Dynamic Slicing). Auf diese Daten aufbauend, stellen wir einen hybriden Algorithmus vor, der die Leistung der aktuellen AFL-Techniken übertrifft. (3) Wir evaluieren die Häufigkeit und Ursachen von ungültigen Eingaben in der Softwarepraxis und stellen einen auf diesen Daten aufbauenden universell einsetzbaren Algorithmus (ddmax) vor, der automatisch defekte Eingaben diagnostiziert und behebt. (4) Wir stellen eine Methode vor, die Verteilung von Schnipseln von Eingaben in der Software-Praxis zu lernen, indem wir Grammatiken mit Wahrscheinlichkeiten nutzen. Die gelernten Verteilungen benutzen wir dann, um den Beispiel-Eingaben ähnliche (oder verschiedene) Eingaben zu erzeugen. Zusammenfassend stellen wir einen auf der Praxis beruhenden Ansatz zum Testen und Debuggen von Software vor, welcher auf empirischen Daten aus der Software-Praxis basiert, um das Testen und Debuggen zu unterstützen. In unserer Evaluierung haben wir festgestellt, dass unser Ansatz effektiv viele Debugging-Disziplinen in der Praxis verbessert. Genauer gesagt finden wir mit unserem Ansatz die genauen Bedürfnisse von Entwicklern, verbessern die Effektivität vieler AFL-Techniken, debuggen und beheben effektiv fehlerhafte Eingaben und generieren Test-Eingaben, die (un)ähnlich zu Eingaben aus der echten Welt sind. Unsere vorgestellten Methoden basieren auf empirischen Daten und verbessern die aktuellen Techniken des Testens und Debuggens
    corecore