240 research outputs found
Semantic Source Code Models Using Identifier Embeddings
The emergence of online open source repositories in the recent years has led
to an explosion in the volume of openly available source code, coupled with
metadata that relate to a variety of software development activities. As an
effect, in line with recent advances in machine learning research, software
maintenance activities are switching from symbolic formal methods to
data-driven methods. In this context, the rich semantics hidden in source code
identifiers provide opportunities for building semantic representations of code
which can assist tasks of code search and reuse. To this end, we deliver in the
form of pretrained vector space models, distributed code representations for
six popular programming languages, namely, Java, Python, PHP, C, C++, and C#.
The models are produced using fastText, a state-of-the-art library for learning
word representations. Each model is trained on data from a single programming
language; the code mined for producing all models amounts to over 13.000
repositories. We indicate dissimilarities between natural language and source
code, as well as variations in coding conventions in between the different
programming languages we processed. We describe how these heterogeneities
guided the data preprocessing decisions we took and the selection of the
training parameters in the released models. Finally, we propose potential
applications of the models and discuss limitations of the models.Comment: 16th International Conference on Mining Software Repositories (MSR
2019): Data Showcase Trac
Commands as AI Conversations
Developers and data scientists often struggle to write command-line inputs,
even though graphical interfaces or tools like ChatGPT can assist. The
solution? "ai-cli," an open-source system inspired by GitHub Copilot that
converts natural language prompts into executable commands for various Linux
command-line tools. By tapping into OpenAI's API, which allows interaction
through JSON HTTP requests, "ai-cli" transforms user queries into actionable
command-line instructions. However, integrating AI assistance across multiple
command-line tools, especially in open source settings, can be complex.
Historically, operating systems could mediate, but individual tool
functionality and the lack of a unified approach have made centralized
integration challenging. The "ai-cli" tool, by bridging this gap through
dynamic loading and linking with each program's Readline library API, makes
command-line interfaces smarter and more user-friendly, opening avenues for
further enhancement and cross-platform applicability.Comment: 5 page
Detecting Missing Dependencies and Notifiers in Puppet Programs
Puppet is a popular computer system configuration management tool. It
provides abstractions that enable administrators to setup their computer
systems declaratively. Its use suffers from two potential pitfalls. First, if
ordering constraints are not specified whenever an abstraction depends on
another, the non-deterministic application of abstractions can lead to race
conditions. Second, if a service is not tied to its resources through
notification constructs, the system may operate in a stale state whenever a
resource gets modified. Such faults can degrade a computing infrastructure's
availability and functionality.
We have developed an approach that identifies these issues through the
analysis of a Puppet program and its system call trace. Specifically, we
present a formal model for traces, which allows us to capture the interactions
of Puppet abstractions with the file system. By analyzing these interactions we
identify (1) abstractions that are related to each other (e.g., operate on the
same file), and (2) abstractions that should act as notifiers so that changes
are correctly propagated. We then check the relationships from the trace's
analysis against the program's dependency graph: a representation containing
all the ordering constraints and notifications declared in the program. If a
mismatch is detected, our system reports a potential fault.
We have evaluated our method on a large set of Puppet modules, and discovered
57 previously unknown issues in 30 of them. Benchmarking further shows that our
approach can analyze in minutes real-world configurations with a magnitude
measured in thousands of lines and millions of system calls
On the Feasibility of Transfer-learning Code Smells using Deep Learning
Context: A substantial amount of work has been done to detect smells in
source code using metrics-based and heuristics-based methods. Machine learning
methods have been recently applied to detect source code smells; however, the
current practices are considered far from mature. Objective: First, explore the
feasibility of applying deep learning models to detect smells without extensive
feature engineering, just by feeding the source code in tokenized form. Second,
investigate the possibility of applying transfer-learning in the context of
deep learning models for smell detection. Method: We use existing metric-based
state-of-the-art methods for detecting three implementation smells and one
design smell in C# code. Using these results as the annotated gold standard, we
train smell detection models on three different deep learning architectures.
These architectures use Convolution Neural Networks (CNNs) of one or two
dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden
layers. For the first objective of our study, we perform training and
evaluation on C# samples, whereas for the second objective, we train the models
from C# code and evaluate the models over Java code samples. We perform the
experiments with various combinations of hyper-parameters for each model.
Results: We find it feasible to detect smells using deep learning methods. Our
comparative experiments find that there is no clearly superior method between
CNN-1D and CNN-2D. We also observe that performance of the deep learning models
is smell-specific. Our transfer-learning experiments show that
transfer-learning is definitely feasible for implementation smells with
performance comparable to that of direct-learning. This work opens up a new
paradigm to detect code smells by transfer-learning especially for the
programming languages where the comprehensive code smell detection tools are
not available
Open Source Adoption In Large US Companies
Various organizations increasingly adopt open source software, both on desktop PCs and servers. Since the first movements in open source in the 1960’s its growth has lead to new approaches in software development, licensing, and distribution, as well as in software vendors’ business models. The literature includes very interesting studies regarding prospective benefits, business models and case studies. However, the adoption of open source in large, global companies and its relationship with factors such as profitability, revenues and industry sector has not yet been researched. This study aims to answer these questions based on data we collected from Fortune 1000 companies and provides a method that can be applied in similar contexts
Definitions of a Software Smell
Many authors have defined smells from their perspective. This document attempts to provide a consolidated list of such definitions
Identifying Bugs in Make and JVM-Oriented Builds
Incremental and parallel builds are crucial features of modern build systems.
Parallelism enables fast builds by running independent tasks simultaneously,
while incrementality saves time and computing resources by processing the build
operations that were affected by a particular code change. Writing build
definitions that lead to error-free incremental and parallel builds is a
challenging task. This is mainly because developers are often unable to predict
the effects of build operations on the file system and how different build
operations interact with each other. Faulty build scripts may seriously degrade
the reliability of automated builds, as they cause build failures, and
non-deterministic and incorrect build results.
To reason about arbitrary build executions, we present buildfs, a
generally-applicable model that takes into account the specification (as
declared in build scripts) and the actual behavior (low-level file system
operation) of build operations. We then formally define different types of
faults related to incremental and parallel builds in terms of the conditions
under which a file system operation violates the specification of a build
operation. Our testing approach, which relies on the proposed model, analyzes
the execution of single full build, translates it into buildfs, and uncovers
faults by checking for corresponding violations.
We evaluate the effectiveness, efficiency, and applicability of our approach
by examining hundreds of Make and Gradle projects. Notably, our method is the
first to handle Java-oriented build systems. The results indicate that our
approach is (1) able to uncover several important issues (245 issues found in
45 open-source projects have been confirmed and fixed by the upstream
developers), and (2) orders of magnitude faster than a state-of-the-art tool
for Make builds
Global software development in the freeBSD project
Freebsd is a sophisticated operating system developed and maintained as open-source software by a team of more than 350 individuals located throughout the world. This study uses developer location data, the configuration management repository, and records from the issue database to examine the extent of global development and its effect on produc-tivity, quality, and developer cooperation. The key findings are that global development allows round-the-clock work, but there are some marked differences between the type of work performed at different regions. The effects of multiple dispersed developers on the quality of code and productiv-ity are negligible. Mentoring appears to be sometimes as-sociated with developers living closer together, but ad-hoc cooperation seems to work fine across continents
Software engineering for deep learning applications: usage of SWEng and MLops tools in GitHub repositories
The rising popularity of deep learning (DL) methods and techniques has
invigorated interest in the topic of SE4DL, the application of software
engineering (SE) practices on deep learning software. Despite the novel
engineering challenges brought on by the data-driven and non-deterministic
paradigm of DL software, little work has been invested into developing
AI-targeted SE tools. On the other hand, tools tackling more general
engineering issues in DL are actively used and referred to under the umbrella
term of ``MLOps tools''. Furthermore, the available literature supports the
utility of conventional SE tooling in DL software development. Building upon
previous MSR research on tool usage in open-source software works, we identify
conventional and MLOps tools adopted in popular applied DL projects that use
Python as the main programming language. About 70% of the GitHub repositories
mined contained at least one conventional SE tool. Software configuration
management tools are the most adopted, while the opposite applies to
maintenance tools. Substantially fewer MLOps tools were in use, with only 9
tools out of a sample of 80 used in at least one repository. The majority of
them were open-source rather than proprietary. One of these tools, TensorBoard,
was found to be adopted in about half of the repositories in our study.
Consequently, the use of conventional SE tooling demonstrates its relevance to
DL software. Further research is recommended on the adoption of MLOps tooling
by open-source projects, focusing on the relevance of particular tool types,
the development of required tools, as well as ways to promote the use of
already available tools
- …