Search CORE

66 research outputs found

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Author: Liu Jiawei
Wang Yuyao
Xia Chunqiu Steven
Zhang Lingming
Publication venue
Publication date: 02/05/2023
Field of study

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code according to user intent written in natural language. Code evaluation datasets, containing curated synthesis problems with input/output test-cases, are used to measure the performance of various LLMs on code synthesis. However, test-cases in these datasets can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis benchmarking framework to rigorously evaluate the functional correctness of LLM-synthesized code. In short, EvalPlus takes in the base evaluation dataset and uses an automatic input generation step to produce and diversify large amounts of new test inputs using both LLM-based and mutation-based input generators to further validate the synthesized code. We extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x additionally generated tests. Our extensive evaluation across 14 popular LLMs demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average! Moreover, we even found several incorrect ground-truth implementations in HUMANEVAL. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis but also opens up a new direction to improve programming benchmarks through automated test input generation

arXiv.org e-Print Archive

Applying Formal Methods to Networking: Theory, Techniques and Applications

Author: Hasan Osman
Qadir Junaid
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/11/2013
Field of study

Despite its great importance, modern network infrastructure is remarkable for the lack of rigor in its engineering. The Internet which began as a research experiment was never designed to handle the users and applications it hosts today. The lack of formalization of the Internet architecture meant limited abstractions and modularity, especially for the control and management planes, thus requiring for every new need a new protocol built from scratch. This led to an unwieldy ossified Internet architecture resistant to any attempts at formal verification, and an Internet culture where expediency and pragmatism are favored over formal correctness. Fortunately, recent work in the space of clean slate Internet design---especially, the software defined networking (SDN) paradigm---offers the Internet community another chance to develop the right kind of architecture and abstractions. This has also led to a great resurgence in interest of applying formal methods to specification, verification, and synthesis of networking protocols and applications. In this paper, we present a self-contained tutorial of the formidable amount of work that has been done in formal methods, and present a survey of its applications to networking.Comment: 30 pages, submitted to IEEE Communications Surveys and Tutorial

arXiv.org e-Print Archive

CiteSeerX

Metamorphic Testing of Datalog Engines

Author: Christakis M.
Mansur M.
Wüstholz V.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

MPG.PuRe

Efficient Hybrid Fuzzing for Detecting Vulnerabilities and Achieving High Coverage in Software

Author: Alshmrany Kaled
Publication venue
Publication date: 01/08/2023
Field of study

The University of Manchester - Institutional Repository

Computer Aided Verification

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/08/2022
Field of study

This open access two-volume set LNCS 13371 and 13372 constitutes the refereed proceedings of the 34rd International Conference on Computer Aided Verification, CAV 2022, which was held in Haifa, Israel, in August 2022. The 40 full papers presented together with 9 tool papers and 2 case studies were carefully reviewed and selected from 209 submissions. The papers were organized in the following topical sections: Part I: Invited papers; formal methods for probabilistic programs; formal methods for neural networks; software Verification and model checking; hyperproperties and security; formal methods for hardware, cyber-physical, and hybrid systems. Part II: Probabilistic techniques; automata and logic; deductive verification and decision procedures; machine learning; synthesis and concurrency. This is an open access book

Directory of Open Access Books (DOAB)

Tools and Algorithms for the Construction and Analysis of Systems

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/04/2022
Field of study

This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems

Directory of Open Access Books (DOAB)

Selective Dynamic Analysis of Virtualized Whole-System Guest Environments

Author: Henderson Andrew William
Publication venue: SURFACE at Syracuse University
Publication date: 01/12/2016
Field of study

Dynamic binary analysis is a prevalent and indispensable technique in program analysis. While several dynamic binary analysis tools and frameworks have been proposed, all suffer from one or more of: prohibitive performance degradation, a semantic gap between the analysis code and the execution under analysis, architecture/OS specificity, being user-mode only, and lacking flexibility and extendability. This dissertation describes the design of the Dynamic Executable Code Analysis Framework (DECAF), a virtual machine-based, multi-target, whole-system dynamic binary analysis framework. In short, DECAF seeks to address the shortcomings of existing whole-system dynamic analysis tools and extend the state of the art by utilizing a combination of novel techniques to provide rich analysis functionality without crippling amounts of execution overhead. DECAF extends the mature QEMU whole-system emulator, a type-2 hypervisor capable of emulating every instruction that executes within a complete guest system environment. DECAF provides a novel, hardware event-based method of just-in-time virtual machine introspection (VMI) to address the semantic gap problem. It also implements a novel instruction-level taint tracking engine at bitwise level of granularity, ensuring that taint propagation is sound and highly precise throughout the guest environment. A formal analysis of the taint propagation rules is provided to verify that most instructions introduce neither false positives nor false negatives. DECAF’s design also provides a plugin architecture with a simple-to-use, event-driven programming interface that makes it both flexible and extendable for a variety of analysis tasks. The implementation of DECAF consists of 9550 lines of C++ code and 10270 lines of C code. Its performance is evaluated using CPU2006 SPEC benchmarks, which show an average overhead of 605% for system wide tainting and 12% for VMI. Three platformneutral DECAF plugins - Instruction Tracer, Keylogger Detector, and API Tracer - are described and evaluated in this dissertation to demonstrate the ease of use and effectiveness of DECAF in writing cross-platform and system-wide analysis tools. This dissertation also presents the Virtual Device Fuzzer (VDF), a scalable fuzz testing framework for discovering bugs within the virtual devices implemented as part of QEMU. Such bugs could be used by malicious software executing within a guest under analysis by DECAF, so the discovery, reproduction, and diagnosis of such bugs helps to protect DECAF against attack while improving QEMU and any analysis platforms built upon QEMU. VDF uses selective instrumentation to perform targeted fuzz testing, which explores only the branches of execution belonging to virtual devices under analysis. By leveraging record and replay of memory-mapped I/O activity, VDF quickly cycles virtual devices through an arbitrarily large number of states without requiring a guest OS to be booted or present. Once a test case is discovered that triggers a bug, VDF reduces the test case to the minimum number of reads/writes required to trigger the bug and generates source code suitable for reproducing the bug during debugging and analysis. VDF is evaluated by fuzz testing eighteen QEMU virtual devices, generating 1014 crash or hang test cases that reveal bugs in six of the tested devices. Over 80% of the crashes and hangs were discovered within the first day of testing. VDF covered an average of 62.32% of virtual device branches during testing, and the average test case was minimized to a reproduction test case only 18.57% of its original size

Syracuse University Research Facility and Collaborative Environment

Recommended from our members

Contract-driven data structure repair : a novel approach for error recovery

Author: Nokhbeh Zaeem Razieh
Publication venue
Publication date: 02/07/2014
Field of study

textSoftware systems are now pervasive throughout our world. The reliability of these systems is an urgent necessity. A large degree of research effort on increasing software reliability is dedicated to requirements, architecture, design, implementation and testing---activities that are performed before system deployment. While such approaches have become substantially more advanced, software remains buggy and failures remain expensive. We take a radically different approach to reliability from previous approaches, namely contract-driven data structure repair for runtime error recovery, where erroneous executions of deployed software are corrected on-the-fly using rich behavioral contracts. Our key insight is to transform the software contract---which gives a high level description of the expected behavior---to an efficient implementation which repairs the erroneous data structures in the program state upon an error. To improve efficiency, scalability, and effectiveness of repair, in addition to rich behavioral contracts, we leverage the current erroneous state, dynamic behavior of the program, as well as repair history and abstraction. A core technical problem our approach to repair addresses is construction of structurally complex data that satisfy desired properties. We present a novel structure generation technique based on dynamic programming---a classic optimization approach---to utilize the recursive nature of the structures. We use our technique for constraint-based testing. It provides better scalability than previous work. We applied it to test widely-used web browsers and found some known and unknown bugs. Our use of dynamic programming in structure generation opens a new future direction to tackle the scalability problem of data structure repair. This research advances our ability to develop correct programs. For programs that already have contracts, error recovery using our approach can come at a low cost. The same contracts can be used for systematically testing code before deployment using existing as well as our new techniques. Thus, we enable a novel unification of software verification and error recovery.Electrical and Computer Engineerin

Texas ScholarWorks