66 research outputs found
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Program synthesis has been long studied with recent approaches focused on
directly using the power of Large Language Models (LLMs) to generate code
according to user intent written in natural language. Code evaluation datasets,
containing curated synthesis problems with input/output test-cases, are used to
measure the performance of various LLMs on code synthesis. However, test-cases
in these datasets can be limited in both quantity and quality for fully
assessing the functional correctness of the generated code. Such limitation in
the existing benchmarks begs the following question: In the era of LLMs, is the
code generated really correct? To answer this, we propose EvalPlus -- a code
synthesis benchmarking framework to rigorously evaluate the functional
correctness of LLM-synthesized code. In short, EvalPlus takes in the base
evaluation dataset and uses an automatic input generation step to produce and
diversify large amounts of new test inputs using both LLM-based and
mutation-based input generators to further validate the synthesized code. We
extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x
additionally generated tests. Our extensive evaluation across 14 popular LLMs
demonstrates that HUMANEVAL+ is able to catch significant amounts of previously
undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on
average! Moreover, we even found several incorrect ground-truth implementations
in HUMANEVAL. Our work not only indicates that prior popular code synthesis
evaluation results do not accurately reflect the true performance of LLMs for
code synthesis but also opens up a new direction to improve programming
benchmarks through automated test input generation
Applying Formal Methods to Networking: Theory, Techniques and Applications
Despite its great importance, modern network infrastructure is remarkable for
the lack of rigor in its engineering. The Internet which began as a research
experiment was never designed to handle the users and applications it hosts
today. The lack of formalization of the Internet architecture meant limited
abstractions and modularity, especially for the control and management planes,
thus requiring for every new need a new protocol built from scratch. This led
to an unwieldy ossified Internet architecture resistant to any attempts at
formal verification, and an Internet culture where expediency and pragmatism
are favored over formal correctness. Fortunately, recent work in the space of
clean slate Internet design---especially, the software defined networking (SDN)
paradigm---offers the Internet community another chance to develop the right
kind of architecture and abstractions. This has also led to a great resurgence
in interest of applying formal methods to specification, verification, and
synthesis of networking protocols and applications. In this paper, we present a
self-contained tutorial of the formidable amount of work that has been done in
formal methods, and present a survey of its applications to networking.Comment: 30 pages, submitted to IEEE Communications Surveys and Tutorial
Computer Aided Verification
This open access two-volume set LNCS 13371 and 13372 constitutes the refereed proceedings of the 34rd International Conference on Computer Aided Verification, CAV 2022, which was held in Haifa, Israel, in August 2022. The 40 full papers presented together with 9 tool papers and 2 case studies were carefully reviewed and selected from 209 submissions. The papers were organized in the following topical sections: Part I: Invited papers; formal methods for probabilistic programs; formal methods for neural networks; software Verification and model checking; hyperproperties and security; formal methods for hardware, cyber-physical, and hybrid systems. Part II: Probabilistic techniques; automata and logic; deductive verification and decision procedures; machine learning; synthesis and concurrency. This is an open access book
Tools and Algorithms for the Construction and Analysis of Systems
This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems
Selective Dynamic Analysis of Virtualized Whole-System Guest Environments
Dynamic binary analysis is a prevalent and indispensable technique in program analysis. While several dynamic binary analysis tools and frameworks have been proposed, all suffer from one or more of: prohibitive performance degradation, a semantic gap between the analysis code and the execution under analysis, architecture/OS specificity, being user-mode only, and lacking flexibility and extendability. This dissertation describes the design of the Dynamic Executable Code Analysis Framework (DECAF), a virtual machine-based, multi-target, whole-system dynamic binary analysis framework. In short, DECAF seeks to address the shortcomings of existing whole-system dynamic analysis tools and extend the state of the art by utilizing a combination of novel techniques to provide rich analysis functionality without crippling amounts of execution overhead. DECAF extends the mature QEMU whole-system emulator, a type-2 hypervisor capable of emulating every instruction that executes within a complete guest system environment.
DECAF provides a novel, hardware event-based method of just-in-time virtual machine introspection (VMI) to address the semantic gap problem. It also implements a novel instruction-level taint tracking engine at bitwise level of granularity, ensuring that taint propagation is sound and highly precise throughout the guest environment. A formal analysis of the taint propagation rules is provided to verify that most instructions introduce neither false positives nor false negatives. DECAF’s design also provides a plugin architecture with a simple-to-use, event-driven programming interface that makes it both flexible and extendable for a variety of analysis tasks.
The implementation of DECAF consists of 9550 lines of C++ code and 10270 lines of C code. Its performance is evaluated using CPU2006 SPEC benchmarks, which show an average overhead of 605% for system wide tainting and 12% for VMI. Three platformneutral DECAF plugins - Instruction Tracer, Keylogger Detector, and API Tracer - are described and evaluated in this dissertation to demonstrate the ease of use and effectiveness of DECAF in writing cross-platform and system-wide analysis tools.
This dissertation also presents the Virtual Device Fuzzer (VDF), a scalable fuzz testing framework for discovering bugs within the virtual devices implemented as part of QEMU. Such bugs could be used by malicious software executing within a guest under analysis by DECAF, so the discovery, reproduction, and diagnosis of such bugs helps to protect DECAF against attack while improving QEMU and any analysis platforms built upon QEMU. VDF uses selective instrumentation to perform targeted fuzz testing, which explores only the branches of execution belonging to virtual devices under analysis. By leveraging record and replay of memory-mapped I/O activity, VDF quickly cycles virtual devices through an arbitrarily large number of states without requiring a guest OS to be booted or present. Once a test case is discovered that triggers a bug, VDF reduces the test case to the minimum number of reads/writes required to trigger the bug and generates source code suitable for reproducing the bug during debugging and analysis.
VDF is evaluated by fuzz testing eighteen QEMU virtual devices, generating 1014 crash or hang test cases that reveal bugs in six of the tested devices. Over 80% of the crashes and hangs were discovered within the first day of testing. VDF covered an average of 62.32% of virtual device branches during testing, and the average test case was minimized to a reproduction test case only 18.57% of its original size
Recommended from our members
Contract-driven data structure repair : a novel approach for error recovery
textSoftware systems are now pervasive throughout our world. The reliability of these systems is an urgent necessity. A large degree of research effort on increasing software reliability is dedicated to requirements, architecture, design, implementation and testing---activities that are performed before system deployment. While such approaches have become substantially more advanced, software remains buggy and failures remain expensive. We take a radically different approach to reliability from previous approaches, namely contract-driven data structure repair for runtime error recovery, where erroneous executions of deployed software are corrected on-the-fly using rich behavioral contracts. Our key insight is to transform the software contract---which gives a high level description of the expected behavior---to an efficient implementation which repairs the erroneous data structures in the program state upon an error. To improve efficiency, scalability, and effectiveness of repair, in addition to rich behavioral contracts, we leverage the current erroneous state, dynamic behavior of the program, as well as repair history and abstraction. A core technical problem our approach to repair addresses is construction of structurally complex data that satisfy desired properties. We present a novel structure generation technique based on dynamic programming---a classic optimization approach---to utilize the recursive nature of the structures. We use our technique for constraint-based testing. It provides better scalability than previous work. We applied it to test widely-used web browsers and found some known and unknown bugs. Our use of dynamic programming in structure generation opens a new future direction to tackle the scalability problem of data structure repair. This research advances our ability to develop correct programs. For programs that already have contracts, error recovery using our approach can come at a low cost. The same contracts can be used for systematically testing code before deployment using existing as well as our new techniques. Thus, we enable a novel unification of software verification and error recovery.Electrical and Computer Engineerin
- …