465 research outputs found
SoK:Prudent Evaluation Practices for Fuzzing
Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, we find that the search for vulnerabilities in real-world software leads to authors requesting and receiving CVEs of questionable quality. Extending our literature analysis to the practical domain, we attempt to reproduce claims of eight fuzzing papers. These case studies allow us to assess the practical reproducibility of fuzzing research and identify archetypal pitfalls in the evaluation design. Unfortunately, our reproduced results reveal several deficiencies in the studied papers, and we are unable to fully support and reproduce the respective claims. To help the field of fuzzing move toward a scientifically reproducible evaluation strategy, we propose updated guidelines for conducting a fuzzing evaluation that future work should follow
Investigating and Testing Performance Issues in Deep Learning Frameworks
Machine Learning (ML) and Deep Learning (DL) applications are becoming more popular due to the availability of DL frameworks such as PyTorch, Keras, and TensorFlow. Therefore, the quality of DL frameworks is essential to ensure DL/ML application quality. Given the computationally expensive nature of DL tasks (e.g., training), performance is a critical aspect of DL frameworks. However, optimizing DL frameworks may have its own unique challenges due to the peculiarities of DL (e.g., hardware integration and the nature of the computation).
In this thesis, we first aim to better understand performance bugs in DL frameworks by conducting an empirical study. We conduct our study on PyTorch and TensorFlow by mining and studying their performance and non-performance bug reports from their respective GitHub repositories. We find that 1) the proportion of newly reported performance bugs increases faster than fixed performance bugs, and the ratio of performance bugs among all bugs increases over time; 2) performance bugs take more time to fix, have larger fix sizes, and more community engagement (e.g., discussion) compared to non-performance bugs; and 3) we manually derived a taxonomy of 12 categories and 19 sub-categories of the root causes of performance bugs in DL frameworks by studying all performance bug fixes.
We then aim to investigate the potential of differential testing as a viable technique to detect and prevent performance bugs in DL frameworks. To do so, we train and evaluate two state-of-the-art CNN and RNN architectures (i.e., the Lenet-5 architecture on the MNIST dataset and the LSTM architecture on the IMDB movie review dataset), using different DL frameworks (i.e., PyTorch, Keras, and TensorFlow), and different configurations (i.e., the training dataset sample size, the batch size, the number of epochs, the weight initialization technique, the data type, the hardware used, the learning rate, and the dropout rate). To assess the performance of the DL models, we use a variety of performance metrics (i.e., training/inference time, hardware (CPU or GPU) usage during training/inference, and memory (RAM or GPU VRAM) usage during training/inference). Then, we compare the performance of the DL models across the DL frameworks. We train and evaluate 21,870 Lenet5 models and 21,870 LSTM models across the DL frameworks, for a grand total of 43,740 models. Our experiments took over 42 days. We find that 1) differences in performance between different DL frameworks, for the same task, may be indicative of a performance optimization opportunity/performance bug; 2) our approach is viable when training and evaluating a smaller number of DL models, which makes it more accessible for developers.
Finally, we present some potential avenues for future work that aim to further study performance bugs in DL frameworks
Evaluation Methodologies in Software Protection Research
Man-at-the-end (MATE) attackers have full control over the system on which
the attacked software runs, and try to break the confidentiality or integrity
of assets embedded in the software. Both companies and malware authors want to
prevent such attacks. This has driven an arms race between attackers and
defenders, resulting in a plethora of different protection and analysis
methods. However, it remains difficult to measure the strength of protections
because MATE attackers can reach their goals in many different ways and a
universally accepted evaluation methodology does not exist. This survey
systematically reviews the evaluation methodologies of papers on obfuscation, a
major class of protections against MATE attacks. For 572 papers, we collected
113 aspects of their evaluation methodologies, ranging from sample set types
and sizes, over sample treatment, to performed measurements. We provide
detailed insights into how the academic state of the art evaluates both the
protections and analyses thereon. In summary, there is a clear need for better
evaluation methodologies. We identify nine challenges for software protection
evaluations, which represent threats to the validity, reproducibility, and
interpretation of research results in the context of MATE attacks
Guiding Quality Assurance Through Context Aware Learning
Software Testing is a quality control activity that, in addition to finding flaws or bugs, provides confidence in the software’s correctness. The quality of the developed software depends on the strength of its test suite. Mutation Testing has shown that it effectively guides in improving the test suite’s strength. Mutation is a test adequacy criterion in which test requirements are represented by mutants. Mutants are slight syntactic modifications of the original program that aim to introduce semantic deviations (from the original program) necessitating the testers to design tests to kill these mutants, i.e., to distinguish the observable behavior between a mutant and the original program. This process of designing tests to kill a mutant is iteratively performed for the entire mutant set, which results in augmenting the test suite, hence improving its strength.
Although mutation testing is empirically validated, a key issue is that its application is expensive due to the large number of low-utility mutants that it introduces. Some mutants cannot be even killed as they are functionally equivalent to the original program. To reduce the application cost, it is imperative to limit the number of mutants to those that are actually useful. Since it requires manual analysis and test executions to identify such mutants, there is a lack of an effective solution to the problem. Hence, it remains unclear how to mutate and test a code efficiently.
On the other hand, with the advancement in deep learning, several works in the literature recently focused on using it on source code to automate many nontrivial tasks including bug fixing, producing code comments, code completion, and program repair. The increasing utilization of deep learning is due to a combination of factors. The first is the vast availability of data to learn from, specifically source code in open-source repositories. The second is the availability of inexpensive hardware able to efficiently run deep learning infrastructures. The third and the most compelling is its ability to automatically learn the categorization of data by learning the code context through its hidden layer architecture, making it especially proficient in identifying features. Thus, we explore the possibility of employing deep learning to identify only useful mutants, in order to achieve a good trade-off between the invested effort and test effectiveness.
Hence, as our first contribution, this dissertation proposes Cerebro, a deep learning approach to statically select subsuming mutants based on the mutants’ surrounding code context. As subsuming mutants reside at the top of the subsumption hierarchy, test cases designed to only kill this minimal subset of mutants kill all the remaining mutants. Our evaluation of Cerebro demonstrates that it preserves the mutation testing benefits while limiting the application cost, i.e., reducing all cost factors such as equivalent mutants, mutant executions, and the mutants requiring analysis.
Apart from improving test suite strength, mutation testing has been proven useful in inferring software specifications. Software specifications aim at describing the software’s intended behavior and can be used to distinguish correct from incorrect software behaviors. Specification inference techniques aim at inferring assertions by generating and filtering candidate assertions through dynamic test executions and mutation testing. Due to the introduction of a large number of mutants during mutation testing such techniques are also computationally expensive, hence establishing a need for the selection of mutants that fit best for assertion inference. We refer to such mutants as Assertion Inferring Mutants. In our analysis, we find that the assertion inferring mutants are significantly different from the subsuming mutants. Thus, we explored the employability of deep learning to identify Assertion Inferring Mutants. Hence, as our second contribution, this dissertation proposes Seeker, a deep learning approach to statically select Assertion Inferring Mutants. Our evaluation demonstrates that Seeker enables an assertion inference capability comparable to the full mutation analysis while significantly limiting the execution cost.
In addition to testing software in general, a few works in the literature attempt to employ mutation testing to tackle security-related issues, due to the fault-based nature of the technique. These works propose mutation operators to convert non-vulnerable code to vulnerable by mimicking common security bugs. However, these pattern-based approaches have two major limitations. Firstly, the design of security-specific mutation operators is not trivial. It requires manual analysis and comprehension of the vulnerability classes. Secondly, these mutation operators can alter the program semantics in a manner that is not convincing for developers and is perceived as unrealistic, thereby hindering the usability of the method. On the other hand, with the release of powerful language models trained on large code corpus, e.g. CodeBERT, a new family of mutation testing tools has arisen with the promise to generate natural mutants. We study the extent to which the mutants produced by language models can semantically mimic the behavior of vulnerabilities aka Vulnerability-mimicking Mutants. Designed test cases failed by these mutants will also tackle mimicked vulnerabilities. In our analysis, we found that a very small subset of mutants is vulnerability-mimicking. Though, this set mimics more than half of the vulnerabilities in our dataset. Due to the absence of any defined features to identify vulnerability-mimicking mutants, as our third contribution, this dissertation introduces Mystique, a deep learning approach that automatically extracts features to identify vulnerability-mimicking mutants. Despite the scarcity, Mystique predicts vulnerability-mimicking mutants with a high prediction performance, demonstrating that their features can be automatically learned by deep learning models to statically predict these without the need of investing any effort in defining features.
Since our vulnerability-mimicking mutants cannot mimic all the vulnerabilities, we perceive that these mutants are not a complete representation of all the vulnerabilities and there exists a need for actual vulnerability prediction approaches. Although there exist many such approaches in the literature, their performance is limited due to a few factors. Firstly, vulnerabilities are fewer in comparison to software bugs, limiting the information one can learn from, which affects the prediction performance. Secondly, the existing approaches learn on both, vulnerable, and supposedly non-vulnerable components. This introduces an unavoidable noise in training data, i.e., components with no reported vulnerability are considered non-vulnerable during training, and hence, results in existing approaches performing poorly. We employed deep learning to automatically capture features related to vulnerabilities and explored if we can avoid learning on supposedly non-vulnerable components. Hence, as our final contribution, this dissertation proposes TROVON, a deep learning approach that learns only on components known to be vulnerable, thereby making no assumptions and bypassing the key problem faced by previous techniques. Our comparison of TROVON with existing techniques on security-critical open-source systems with historical vulnerabilities reported in the National Vulnerability Database (NVD) demonstrates that its prediction capability significantly outperforms the existing techniques
SplITS: Split Input-to-State Mapping for Effective Firmware Fuzzing
Ability to test firmware on embedded devices is critical to discovering
vulnerabilities prior to their adversarial exploitation. State-of-the-art
automated testing methods rehost firmware in emulators and attempt to
facilitate inputs from a diversity of methods (interrupt driven, status
polling) and a plethora of devices (such as modems and GPS units). Despite
recent progress to tackle peripheral input generation challenges in rehosting,
a firmware's expectation of multi-byte magic values supplied from peripheral
inputs for string operations still pose a significant roadblock. We solve the
impediment posed by multi-byte magic strings in monolithic firmware. We propose
feedback mechanisms for input-to-state mapping and retaining seeds for targeted
replacement mutations with an efficient method to solve multi-byte comparisons.
The feedback allows an efficient search over a combinatorial solution-space. We
evaluate our prototype implementation, SplITS, with a diverse set of 21
real-world monolithic firmware binaries used in prior works, and 3 new binaries
from popular open source projects. SplITS automatically solves 497% more
multi-byte magic strings guarding further execution to uncover new code and
bugs compared to state-of-the-art. In 11 of the 12 real-world firmware binaries
with string comparisons, including those extensively analyzed by prior works,
SplITS outperformed, statistically significantly. We observed up to 161%
increase in blocks covered and discovered 6 new bugs that remained guarded by
string comparisons. Significantly, deep and difficult to reproduce bugs guarded
by comparisons, identified in prior work, were found consistently. To
facilitate future research in the field, we release SplITS, the new firmware
data sets, and bug analysis at https://github.com/SplITS-FuzzerComment: Accepted ESORICS 202
SHAPFUZZ: Efficient Fuzzing via Shapley-Guided Byte Selection
Mutation-based fuzzing is popular and effective in discovering unseen code
and exposing bugs. However, only a few studies have concentrated on quantifying
the importance of input bytes, which refers to the degree to which a byte
contributes to the discovery of new code. They often focus on obtaining the
relationship between input bytes and path constraints, ignoring the fact that
not all constraint-related bytes can discover new code. In this paper, we
conduct Shapely analysis to understand the effect of byte positions on fuzzing
performance, and find that some byte positions contribute more than others and
this property often holds across seeds. Based on this observation, we propose a
novel fuzzing solution, ShapFuzz, to guide byte selection and mutation.
Specifically, ShapFuzz updates Shapley values (importance) of bytes when each
input is tested during fuzzing with a low overhead, and utilizes contextual
multi-armed bandit to trade off between mutating high Shapley value bytes and
low-frequently chosen bytes. We implement a prototype of this solution based on
AFL++, i.e., ShapFuzz. We evaluate ShapFuzz against ten state-of-the-art
fuzzers, including five byte schedule-reinforced fuzzers and five commonly used
fuzzers. Compared with byte schedule-reinforced fuzzers, ShapFuzz discovers
more edges and exposes more bugs than the best baseline on three different sets
of initial seeds. Compared with commonly used fuzzers, ShapFuzz exposes 20 more
bugs than the best comparison fuzzer, and discovers 6 more CVEs than the best
baseline on MAGMA. Furthermore, ShapFuzz discovers 11 new bugs on the latest
versions of programs, and 3 of them are confirmed by vendors
Learning to Represent Patches
Patch representation is crucial in automating various software engineering
tasks, like determining patch accuracy or summarizing code changes. While
recent research has employed deep learning for patch representation, focusing
on token sequences or Abstract Syntax Trees (ASTs), they often miss the
change's semantic intent and the context of modified lines. To bridge this gap,
we introduce a novel method, Patcherizer. It delves into the intentions of
context and structure, merging the surrounding code context with two innovative
representations. These capture the intention in code changes and the intention
in AST structural modifications pre and post-patch. This holistic
representation aptly captures a patch's underlying intentions. Patcherizer
employs graph convolutional neural networks for structural intention graph
representation and transformers for intention sequence representation. We
evaluated Patcherizer's embeddings' versatility in three areas: (1) Patch
description generation, (2) Patch accuracy prediction, and (3) Patch intention
identification. Our experiments demonstrate the representation's efficacy
across all tasks, outperforming state-of-the-art methods. For example, in patch
description generation, Patcherizer excels, showing an average boost of 19.39%
in BLEU, 8.71% in ROUGE-L, and 34.03% in METEOR scores
Qualitative Analysis for Validating IEC 62443-4-2 Requirements in DevSecOps
Validation of conformance to cybersecurity standards for industrial
automation and control systems is an expensive and time consuming process which
can delay the time to market. It is therefore crucial to introduce conformance
validation stages into the continuous integration/continuous delivery pipeline
of products. However, designing such conformance validation in an automated
fashion is a highly non-trivial task that requires expert knowledge and depends
upon the available security tools, ease of integration into the DevOps
pipeline, as well as support for IT and OT interfaces and protocols.
This paper addresses the aforementioned problem focusing on the automated
validation of ISA/IEC 62443-4-2 standard component requirements. We present an
extensive qualitative analysis of the standard requirements and the current
tooling landscape to perform validation. Our analysis demonstrates the coverage
established by the currently available tools and sheds light on current gaps to
achieve full automation and coverage. Furthermore, we showcase for every
component requirement where in the CI/CD pipeline stage it is recommended to
test it and the tools to do so
Pre-deployment Analysis of Smart Contracts -- A Survey
Smart contracts are programs that execute transactions involving independent
parties and cryptocurrencies. As programs, smart contracts are susceptible to a
wide range of errors and vulnerabilities. Such vulnerabilities can result in
significant losses. Furthermore, by design, smart contract transactions are
irreversible. This creates a need for methods to ensure the correctness and
security of contracts pre-deployment. Recently there has been substantial
research into such methods. The sheer volume of this research makes
articulating state-of-the-art a substantial undertaking. To address this
challenge, we present a systematic review of the literature. A key feature of
our presentation is to factor out the relationship between vulnerabilities and
methods through properties. Specifically, we enumerate and classify smart
contract vulnerabilities and methods by the properties they address. The
methods considered include static analysis as well as dynamic analysis methods
and machine learning algorithms that analyze smart contracts before deployment.
Several patterns about the strengths of different methods emerge through this
classification process
Analyzing the Unanalyzable: an Application to Android Apps
In general, software is unreliable. Its behavior can deviate from users’ expectations because of bugs, vulnerabilities, or even malicious code. Manually vetting software is a challenging, tedious, and highly-costly task that does not scale. To alleviate excessive costs and analysts’ burdens, automated static analysis techniques have been proposed by both the research and practitioner communities making static analysis a central topic in software engineering. In the meantime, mobile apps have considerably grown in importance. Today, most humans carry software in their pockets, with the Android operating system leading the market. Millions of apps have been proposed to the public so far, targeting a wide range of activities such as games, health, banking, GPS, etc. Hence, Android apps collect and manipulate a considerable amount of sensitive information, which puts users’ security and privacy at risk. Consequently, it is paramount to ensure that apps distributed through public channels (e.g., the Google Play) are free from malicious code. Hence, the research and practitioner communities have put much effort into devising new automated techniques to vet Android apps against malicious activities over the last decade. Analyzing Android apps is, however, challenging. On the one hand, the Android framework proposes constructs that can be used to evade dynamic analysis by triggering the malicious code only under certain circumstances, e.g., if the device is not an emulator and is currently connected to power. Hence, dynamic analyses can -easily- be fooled by malicious developers by making some code fragments difficult to reach. On the other hand, static analyses are challenged by Android-specific constructs that limit the coverage of off-the-shell static analyzers. The research community has already addressed some of these constructs, including inter-component communication or lifecycle methods. However, other constructs, such as implicit calls (i.e., when the Android framework asynchronously triggers a method in the app code), make some app code fragments unreachable to the static analyzers, while these fragments are executed when the app is run. Altogether, many apps’ code parts are unanalyzable: they are either not reachable by dynamic analyses or not covered by static analyzers. In this manuscript, we describe our contributions to the research effort from two angles: ①statically detecting malicious code that is difficult to access to dynamic analyzers because they are triggered under specific circumstances; and ② statically analyzing code not accessible to existing static analyzers to improve the comprehensiveness of app analyses. More precisely, in Part I, we first present a replication study of a state-of-the-art static logic bomb detector to better show its limitations. We then introduce a novel hybrid approach for detecting suspicious hidden sensitive operations towards triaging logic bombs. We finally detail the construction of a dataset of Android apps automatically infected with logic bombs. In Part II, we present our work to improve the comprehensiveness of Android apps’ static analysis. More specifically, we first show how we contributed to account for atypical inter-component communication in Android apps. Then, we present a novel approach to unify both the bytecode and native in Android apps to account for the multi-language trend in app development. Finally, we present our work to resolve conditional implicit calls in Android apps to improve static and dynamic analyzers
- …