150 research outputs found
FairFuzz: Targeting Rare Branches to Rapidly Increase Greybox Fuzz Testing Coverage
In recent years, fuzz testing has proven itself to be one of the most
effective techniques for finding correctness bugs and security vulnerabilities
in practice. One particular fuzz testing tool, American Fuzzy Lop or AFL, has
become popular thanks to its ease-of-use and bug-finding power. However, AFL
remains limited in the depth of program coverage it achieves, in particular
because it does not consider which parts of program inputs should not be
mutated in order to maintain deep program coverage. We propose an approach,
FairFuzz, that helps alleviate this limitation in two key steps. First,
FairFuzz automatically prioritizes inputs exercising rare parts of the program
under test. Second, it automatically adjusts the mutation of inputs so that the
mutated inputs are more likely to exercise these same rare parts of the
program. We conduct evaluation on real-world programs against state-of-the-art
versions of AFL, thoroughly repeating experiments to get good measures of
variability. We find that on certain benchmarks FairFuzz shows significant
coverage increases after 24 hours compared to state-of-the-art versions of AFL,
while on others it achieves high program coverage at a significantly faster
rate
Protecting Systems From Exploits Using Language-Theoretic Security
Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience
The construction of oracles for software testing
Software testing is important throughout the software life cycle. Testing is the part of the software development process where a computer program is subject to specific conditions to show that the problem meets its intended design. Building a testing oracle is one part of software testing. An oracle is an external mechanism which can be used to check test output for correctness. The characteristics of available oracles have a dominating influence on the cost and quality of software testing. In this thesis, methods of constructing oracles are investigated and classified. There are three kinds of method of constructing oracles: the pseudo-oracle approach, oracles using attributed grammars and oracles based on formal specification. This thesis develops a method for constructing an oracle, based on the Z specification language. A specification language can describe the correct syntax and semantics of software. The contextual part of a specification describes all the legal input to the program and the semantics part describes the meaning of the given input data. Based on this idea, an oracle is constructed and a prototype is implemented according to the method proposed in the thesis
Semantic Fuzzing with Zest
Programs expecting structured inputs often consist of both a syntactic
analysis stage, which parses raw input, and a semantic analysis stage, which
conducts checks on the parsed input and executes the core logic of the program.
Generator-based testing tools in the lineage of QuickCheck are a promising way
to generate random syntactically valid test inputs for these programs. We
present Zest, a technique which automatically guides QuickCheck-like
randominput generators to better explore the semantic analysis stage of test
programs. Zest converts random-input generators into deterministic parametric
generators. We present the key insight that mutations in the untyped parameter
domain map to structural mutations in the input domain. Zest leverages program
feedback in the form of code coverage and input validity to perform
feedback-directed parameter search. We evaluate Zest against AFL and QuickCheck
on five Java programs: Maven, Ant, BCEL, Closure, and Rhino. Zest covers
1.03x-2.81x as many branches within the benchmarks semantic analysis stages as
baseline techniques. Further, we find 10 new bugs in the semantic analysis
stages of these benchmarks. Zest is the most effective technique in finding
these bugs reliably and quickly, requiring at most 10 minutes on average to
find each bug.Comment: To appear in Proceedings of 28th ACM SIGSOFT International Symposium
on Software Testing and Analysis (ISSTA'19
Grammar-based fuzzing using input features
In grammar-based fuzz testing, a formal grammar is used to produce test inputs that are syntactically valid in order to reach the business logic of a program under test. In this setting, it is advantageous to ensure a high diversity of inputs to test more of the program's behavior. How can we characterize features that make inputs diverse and associate them with the execution of particular parts of the program? Previous work does not answer this question to satisfaction, with most attempts mainly considering superficial features defined by the structure of the grammar such as the presence of production rules or terminal symbols, regardless of their context. We present a measure of input coverage called k-path coverage, which takes into account combinations of grammar entities up to a given context depth k, and makes it possible to efficiently express, assess, and achieve input diversity. In a series of experiments, we demonstrate and evaluate how to systematically attain k-path coverage, how it correlates with code coverage and can thus be used as its predictor. By automatically inferring explicit associations between k-path features and the coverage of individual methods we further show how to generate inputs that specifically target the execution of given code locations. We expect the presented instrument of k-paths to prove useful in numerous additional applications such as assessing the quality of grammars, serving as an adequacy criterion for input test suites, enabling test case prioritization, facilitating program comprehension, and perhaps beyond.Im Bereich des grammatik-basierten Fuzz-Testens benutzt man eine formale Grammatik, um Testeingaben zu produzieren, welche syntaktisch korrekt sind, mit dem Ziel die Geschäftslogik eines zu testenden Programms zu erreichen. Dafür ist es vorteilhaft eine hohe Diversität der Eingaben zu sichern, um mehr vom Verhalten des Programms testen zu können. Wie kann man Merkmale charakterisieren, die Eingaben vielfältig machen und diese mit der Ausführung bestimmter Programmteile in Verbindung bringen? Bisherige Ansätze liefern darauf keine ausreichende Antwort, denn meistens betrachten sie oberflächliche, durch die Grammatikstruktur definierte Merkmale, wie das Vorhandensein von Produktionsregeln oder Terminalen, unabhängig von ihrem Verwendungskontext. Wir präsentieren ein Maß für Eingabeabdeckung, genannt -path Abdeckung, welche Kombinationen von Grammatikelementen bis zu einer vorgegebenen Kontexttiefe berücksichtigt und es ermöglicht, die Diversität von Eingaben effizient auszudrücken, zu bewerten und zu erzielen. Mit Experimenten zeigen und evaluieren wir, wie man gezielt -path Abdeckung erreicht und wie sie mit der Codeabdeckung zusammenhängt und diese somit vorhersagen kann. Ferner zeigen wir wie automatisches Erlernen expliziter Assoziationen zwischen Merkmalen und der Abdeckung einzelner Methoden die Erzeugung von Eingaben ermöglicht, welche auf die Ausführung bestimmter Codestellen abzielen. Wir rechnen damit, dass sich -paths als ein vielseitiges Instrument beweisen, dessen Anwendung über solche Gebiete, wie z.B. Messung der Qualität von Grammatiken und Eingabe-Testsuiten, Testfallpriorisierung, oder Erleichterung von Programmverständnis, hinausgeht
Token-Level Fuzzing
Fuzzing has become a commonly used approach to identifying bugs in complex,
real-world programs. However, interpreters are notoriously difficult to fuzz
effectively, as they expect highly structured inputs, which are rarely produced
by most fuzzing mutations. For this class of programs, grammar-based fuzzing
has been shown to be effective. Tools based on this approach can find bugs in
the code that is executed after parsing the interpreter inputs, by following
language-specific rules when generating and mutating test cases. Unfortunately,
grammar-based fuzzing is often unable to discover subtle bugs associated with
the parsing and handling of the language syntax. Additionally, if the grammar
provided to the fuzzer is incomplete, or does not match the implementation
completely, the fuzzer will fail to exercise important parts of the available
functionality. In this paper, we propose a new fuzzing technique, called
Token-Level Fuzzing. Instead of applying mutations either at the byte level or
at the grammar level, Token-Level Fuzzing applies mutations at the token level.
Evolutionary fuzzers can leverage this technique to both generate inputs that
are parsed successfully and generate inputs that do not conform strictly to the
grammar. As a result, the proposed approach can find bugs that neither
byte-level fuzzing nor grammar-based fuzzing can find. We evaluated Token-Level
Fuzzing by modifying AFL and fuzzing four popular JavaScript engines, finding
29 previously unknown bugs, several of which could not be found with
state-of-the-art byte-level and grammar-based fuzzers
Revisiting Neural Program Smoothing for Fuzzing
Testing with randomly generated inputs (fuzzing) has gained significant
traction due to its capacity to expose program vulnerabilities automatically.
Fuzz testing campaigns generate large amounts of data, making them ideal for
the application of machine learning (ML). Neural program smoothing (NPS), a
specific family of ML-guided fuzzers, aims to use a neural network as a smooth
approximation of the program target for new test case generation.
In this paper, we conduct the most extensive evaluation of NPS fuzzers
against standard gray-box fuzzers (>11 CPU years and >5.5 GPU years), and make
the following contributions: (1) We find that the original performance claims
for NPS fuzzers do not hold; a gap we relate to fundamental, implementation,
and experimental limitations of prior works. (2) We contribute the first
in-depth analysis of the contribution of machine learning and gradient-based
mutations in NPS. (3) We implement Neuzz++, which shows that addressing the
practical limitations of NPS fuzzers improves performance, but that standard
gray-box fuzzers almost always surpass NPS-based fuzzers. (4) As a consequence,
we propose new guidelines targeted at benchmarking fuzzing based on machine
learning, and present MLFuzz, a platform with GPU access for easy and
reproducible evaluation of ML-based fuzzers. Neuzz++, MLFuzz, and all our data
are public.Comment: Accepted as conference paper at ESEC/FSE 202
- …