45 research outputs found
Is GitHub's Copilot as Bad as Humans at Introducing Vulnerabilities in Code?
Several advances in deep learning have been successfully applied to the
software development process. Of recent interest is the use of neural language
models to build tools, such as Copilot, that assist in writing code. In this
paper we perform a comparative empirical analysis of Copilot-generated code
from a security perspective. The aim of this study is to determine if Copilot
is as bad as human developers - we investigate whether Copilot is just as
likely to introduce the same software vulnerabilities that human developers
did. Using a dataset of C/C++ vulnerabilities, we prompt Copilot to generate
suggestions in scenarios that previously led to the introduction of
vulnerabilities by human developers. The suggestions are inspected and
categorized in a 2-stage process based on whether the original vulnerability or
the fix is reintroduced. We find that Copilot replicates the original
vulnerable code ~33% of the time while replicating the fixed code at a ~25%
rate. However this behavior is not consistent: Copilot is more susceptible to
introducing some types of vulnerability than others and is more likely to
generate vulnerable code in response to prompts that correspond to older
vulnerabilities than newer ones. Overall, given that in a substantial
proportion of instances Copilot did not generate code with the same
vulnerabilities that human developers had introduced previously, we conclude
that Copilot is not as bad as human developers at introducing vulnerabilities
in code
An Empirical Study of Goto in C Code from GitHub Repositories
ABSTRACT It is nearly 50 years since Dijkstra argued that goto obscures the flow of control in program execution and urged programmers to abandon the goto statement. While past research has shown that goto is still in use, little is known about whether goto is used in the unrestricted manner that Dijkstra feared, and if it is 'harmful' enough to be a part of a post-release bug. We, therefore, conduct a two part empirical study -(1) qualitatively analyze a statistically representative sample of 384 files from a population of almost 250K C programming language files collected from over 11K GitHub repositories and find that developers use goto in C files for error handling (80.21±5%) and cleaning up resources at the end of a procedure (40.36 ± 5%); an
Towards an Understanding of Large Language Models in Software Engineering Tasks
Large Language Models (LLMs) have drawn widespread attention and research due
to their astounding performance in tasks such as text generation and reasoning.
Derivative products, like ChatGPT, have been extensively deployed and highly
sought after. Meanwhile, the evaluation and optimization of LLMs in software
engineering tasks, such as code generation, have become a research focus.
However, there is still a lack of systematic research on the application and
evaluation of LLMs in the field of software engineering. Therefore, this paper
is the first to comprehensively investigate and collate the research and
products combining LLMs with software engineering, aiming to answer two
questions: (1) What are the current integrations of LLMs with software
engineering? (2) Can LLMs effectively handle software engineering tasks? To
find the answers, we have collected related literature as extensively as
possible from seven mainstream databases, and selected 123 papers for analysis.
We have categorized these papers in detail and reviewed the current research
status of LLMs from the perspective of seven major software engineering tasks,
hoping this will help researchers better grasp the research trends and address
the issues when applying LLMs. Meanwhile, we have also organized and presented
papers with evaluation content to reveal the performance and effectiveness of
LLMs in various software engineering tasks, providing guidance for researchers
and developers to optimize
Fine-Grain Parallelism
Computer hardware is at the beginning of the multi-core revolution. While hardware at the commodity level is capable of running concurrent software, most software does not take advantage of this fact because parallel software development is difficult. This project addressed potential remedies to these difficulties by investigating graphical programming and fine-grain parallelism. A prototype system taking advantage of both of these concepts was implemented and evaluated in terms of real-world applications
Identifying Graphs from Noisy Observational Data
There is a growing amount of data describing networks -- examples include social networks, communication networks, and biological networks. As the amount of available data increases, so does our interest in analyzing the properties and characteristics of these networks. However, in most cases the data is noisy, incomplete, and the result of passively acquired observational data; naively analyzing these networks without taking these errors into account can result in inaccurate and misleading conclusions. In my dissertation, I study the tasks of entity resolution, link prediction, and collective classification to address these deficiencies. I describe these tasks in detail and discuss my own work on each of these tasks. For entity resolution, I develop a method for resolving the identities of name mentions in email communications. For link prediction, I develop a method for inferring subordinate-manager relationships between individuals in an email communication network. For collective classification, I propose an adaptive active surveying method to address node labeling in a query-driven setting on network data. In many real-world settings, however, these deficiencies are not found in isolation and all need to be addressed to infer the desired complete and accurate network. Furthermore, because of the dependencies typically found in these tasks, the tasks are inherently inter-related and must be performed jointly. I define the general problem of graph identification which simultaneously performs these tasks; removing the noise and missing values in the observed input network and inferring the complete and accurate output network. I present a novel approach to graph identification using a collection of Coupled Collective Classifiers, C3, which, in addition to capturing the variety of features typically used for each task, can capture the intra- and inter-dependencies required to correctly infer nodes, edges, and labels in the output network. I discuss variants of C3 using different learning and inference paradigms and show the superior performance of C3, in terms of both prediction quality and runtime performance, over various previous approaches. I then conclude by presenting the Graph Alignment, Identification, and Analysis (GAIA) open-source software library which not only provides an implementation of C3 but also algorithms for various tasks in network data such as entity resolution, link prediction, collective classification, clustering, active learning, data generation, and analysis
Election Security Is Harder Than You Think
Recent years have seen the rise of nation-state interference in elections
across the globe, making the ever-present need for more secure elections all
the more dire. While certain common-sense approaches have been a typical
response in the past, e.g. ``don't connect voting machines to the Internet''
and ``use a voting system with a paper trail'', known-good solutions to
improving election security have languished in relative obscurity for decades.
These techniques are only now finally being implemented at scale, and that
implementation has brought the intricacies of sophisticated approaches to
election security into full relief.
This dissertation argues that while approaches to improve election security
like paper ballots and post-election audits seem straightforward, in reality
there are significant practical barriers to sufficient implementation.
Overcoming these barriers is a necessary condition for an election to be
secure, and while doing so is possible, it requires significant refinement of
existing techniques. In order to better understand how election security
technology can be improved, I first develop what it means for an election to be
secure. I then delve into experimental results regarding voter-verified paper,
discussing the challenges presented by paper ballots as well as some strategies
to improve the security they can deliver. I examine the post-election audit
ecosystem and propose a manifest improvement to audit workload analysis
through parallelization. Finally, I show that even when all of these conditions
are met (as in a vote-by-mail scenario), there are still wrinkles that must be
addressed for an election to be truly secure.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163272/1/matber_1.pd
Automatically Optimizing Tree Traversal Algorithms
Many domains in computer science, from data-mining to graphics to computational astrophysics, focus heavily on irregular applications. In contrast to regular applications, which operate over dense matrices and arrays, irregular programs manipulate and traverse complex data structures like trees and graphs. As irregular applications operate on ever larger datasets, their performance suffers from poor locality and parallelism. Programmers are burdened with the arduous task of manually tuning such applications for better performance. Generally applicable techniques to optimize irregular applications are highly desired, yet scarce.
In this dissertation, we argue that, for an important subset of irregular programs which arises in many domains, namely, tree traversal algorithms like Barnes-Hut, nearest neighbor and ray tracing, there exist general techniques to enhance performance. We investigate two sources of performance improvement: locality enhancement and vectorization. Furthermore we demonstrate that these techniques can be automatically applied by an optimizing compiler, relieving programmers of manual, error-prone, application-specific effort.
Achieving high performance in many applications requires achieving good locality of reference. We propose two novel transformations called point blocking and traversal splicing, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in tree traversals. We then present a transformation framework called TreeSplicer, that automatically applies these transformations, and uses autotuning techniques to determine appropriate parameters for the transformations. For six benchmark algorithms, we show that a combination of point blocking and traversal splicing can deliver single-thread speedups of up to 8.71 (geometric mean: 2.48), just from better locality.
Modern commodity processors support SIMD instructions, and using these instructions to process multiple traversals at once has the potential to provide substantial performance improvements. Unfortunately tree algorithms often feature highly diverging traversals which inhibit efficient SIMD utilization, to the point that other, less profitable sources of vectorization must be exploited instead. We propose a dynamic reordering of traversals based on previous behavior, based on the insight that traversals which have behaved similarly so far are likely to behave similarly in the future, and show that this reordering can dramatically improve the SIMD utilization of diverging traversals, close to ideal utilization. We present a transformation framework, SIMTree, which facilitates vectorization of tree algorithms, and demonstrate speedups of up to 6.59 (geometric mean: 2.78). Furthermore our techniques can effectively SIMDize algorithms that prior, manual vectorization attempts could not