136 research outputs found
A Matching Based Theoretical Framework for Estimating Probability of Causation
The concept of Probability of Causation (PC) is critically important in legal
contexts and can help in many other domains. While it has been around since
1986, current operationalizations can obtain only the minimum and maximum
values of PC, and do not apply for purely observational data. We present a
theoretical framework to estimate the distribution of PC from experimental and
from purely observational data. We illustrate additional problems of the
existing operationalizations and show how our method can be used to address
them. We also provide two illustrative examples of how our method is used and
how factors like sample size or rarity of events can influence the distribution
of PC. We hope this will make the concept of PC more widely usable in practice
Representation of Developer Expertise in Open Source Software
Background: Accurate representation of developer expertise has always been an
important research problem. While a number of studies proposed novel methods of
representing expertise within individual projects, these methods are difficult
to apply at an ecosystem level. However, with the focus of software development
shifting from monolithic to modular, a method of representing developers'
expertise in the context of the entire OSS development becomes necessary when,
for example, a project tries to find new maintainers and look for developers
with relevant skills. Aim: We aim to address this knowledge gap by proposing
and constructing the Skill Space where each API, developer, and project is
represented and postulate how the topology of this space should reflect what
developers know (and projects need). Method: we use the World of Code
infrastructure to extract the complete set of APIs in the files changed by open
source developers and, based on that data, employ Doc2Vec embeddings for vector
representations of APIs, developers, and projects. We then evaluate if these
embeddings reflect the postulated topology of the Skill Space by predicting
what new APIs/projects developers use/join, and whether or not their pull
requests get accepted. We also check how the developers' representations in the
Skill Space align with their self-reported API expertise. Result: Our results
suggest that the proposed embeddings in the Skill Space appear to satisfy the
postulated topology and we hope that such representations may aid in the
construction of signals that increase trust (and efficiency) of open source
ecosystems at large and may aid investigations of other phenomena related to
developer proficiency and learning.Comment: Accepted in ICSE 2021 Main Technical Trac
Which Pull Requests Get Accepted and Why? A study of popular NPM Packages
Background: Pull Request (PR) Integrators often face challenges in terms of
multiple concurrent PRs, so the ability to gauge which of the PRs will get
accepted can help them balance their workload. PR creators would benefit from
knowing if certain characteristics of their PRs may increase the chances of
acceptance. Aim: We modeled the probability that a PR will be accepted within a
month after creation using a Random Forest model utilizing 50 predictors
representing properties of the author, PR, and the project to which PR is
submitted. Method: 483,988 PRs from 4218 popular NPM packages were analysed and
we selected a subset of 14 predictors sufficient for a tuned Random Forest
model to reach high accuracy. Result: An AUC-ROC value of 0.95 was achieved
predicting PR acceptance. The model excluding PR properties that change after
submission gave an AUC-ROC value of 0.89. We tested the utility of our model in
practical scenarios by training it with historical data for the NPM package
\textit{bootstrap} and predicting if the PRs submitted in future will be
accepted. This gave us an AUC-ROC value of 0.94 with all 14 predictors, and
0.77 excluding PR properties that change after its creation. Conclusion: PR
integrators can use our model for a highly accurate assessment of the quality
of the open PRs and PR creators may benefit from the model by understanding
which characteristics of their PRs may be undesirable from the integrators'
perspective. The model can be implemented as a tool, which we plan to do as a
future work
Effect of Technical and Social Factors on Pull Request Quality for the NPM Ecosystem
Pull request (PR) based development, which is a norm for the social coding
platforms, entails the challenge of evaluating the contributions of, often
unfamiliar, developers from across the open source ecosystem and, conversely,
submitting a contribution to a project with unfamiliar maintainers. Previous
studies suggest that the decision of accepting or rejecting a PR may be
influenced by a diverging set of technical and social factors, but often focus
on relatively few projects, do not consider ecosystem-wide measures, or the
possible non-monotonic relationships between the predictors and PR acceptance
probability. We aim to shed light on this important decision making process by
testing which measures significantly affect the probability of PR acceptance on
a significant fraction of a large ecosystem, rank them by their relative
importance in predicting PR acceptance, and determine the shape of the
functions that map each predictor to PR acceptance. We proposed seven
hypotheses regarding which technical and social factors might affect PR
acceptance and created 17 measures based on them. Our dataset consisted of
470,925 PRs from 3349 popular NPM packages and 79,128 GitHub users who created
those. We tested which of the measures affect PR acceptance and ranked the
significant measures by their importance in a predictive model. Our predictive
model had and AUC of 0.94, and 15 of the 17 measures were found to matter,
including five novel ecosystem-wide measures. Measures describing the number of
PRs submitted to a repository and what fraction of those get accepted, and
signals about the PR review phase were most significant. We also discovered
that only four predictors have a linear influence on the PR acceptance
probability while others showed a more complicated response.Comment: arXiv admin note: text overlap with arXiv:2003.01153. Preprint of the
paper accepted in ESEM,2020 conferenc
Machine-assisted annotation of forensic imagery
Image collections, if critical aspects of image content are exposed, can spur
research and practical applications in many domains. Supervised machine
learning may be the only feasible way to annotate very large collections, but
leading approaches rely on large samples of completely and accurately annotated
images. In the case of a large forensic collection, we are aiming to annotate,
neither the complete annotation nor the large training samples can be feasibly
produced. We, therefore, investigate ways to assist manual annotation efforts
done by forensic experts. We present a method that can propose both images and
areas within an image likely to contain desired classes. Evaluation of the
method with human annotators showed highly accurate classification that was
strongly helped by transfer learning. The segmentation precision (mAP) was
improved by adding a separate class capturing background, but that did not
affect the recall (mAR). Further work is needed to both increase the accuracy
of segmentation and enhances prediction with additional covariates affecting
decomposition. We hope this effort to be of help in other domains that require
weak segmentation and have limited availability of qualified annotators.Comment: Submitted to ICIP 201
Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem
Background: Open source requires participation of volunteer and commercial
developers (users) in order to deliver functional high-quality components.
Developers both contribute effort in the form of patches and demand effort from
the component maintainers to resolve issues reported against it. Aim: Identify
and characterize patterns of effort contribution and demand throughout the open
source supply chain and investigate if and how these patterns vary with
developer activity; identify different groups of developers; and predict
developers' company affiliation based on their participation patterns. Method:
1,376,946 issues and pull-requests created for 4433 NPM packages with over
10,000 monthly downloads and full (public) commit activity data of the 272,142
issue creators is obtained and analyzed and dependencies on NPM packages are
identified. Fuzzy c-means clustering algorithm is used to find the groups among
the users based on their effort contribution and demand patterns, and Random
Forest is used as the predictive modeling technique to identify their company
affiliations. Result: Users contribute and demand effort primarily from
packages that they depend on directly with only a tiny fraction of
contributions and demand going to transitive dependencies. A significant
portion of demand goes into packages outside the users' respective supply
chains (constructed based on publicly visible version control data). Three and
two different groups of users are observed based on the effort demand and
effort contribution patterns respectively. The Random Forest model used for
identifying the company affiliation of the users gives a AUC-ROC value of 0.68.
Conclusion: Our results give new insights into effort demand and supply at
different parts of the supply chain of the NPM ecosystem and its users and
suggests the need to increase visibility further upstream.Comment: 10 pages, 5 Tables, 2 Figures, Accepted in The 15th International
Conference on Predictive Models and Data Analytics in Software Engineering
201
ALFAA: Active Learning Fingerprint Based Anti-Aliasing for Correcting Developer Identity Errors in Version Control Data
Graphs of developer networks are important for software engineering research
and practice. For these graphs to realistically represent the networks,
accurate developer identities are imperative. We aim to identify developer
identity errors from open source software repositories in VCS, investigate the
nature of these errors, design corrective algorithms, and estimate the impact
of the errors on networks inferred from this data. We investigate these
questions using over 1B Git commits with over 23M recorded author identities.
By inspecting the author strings that occur most frequently, we group identity
errors into categories. We then augment the author strings with 3 behavioral
fingerprints: time-zone frequencies, the set of files modified, and a vector
embedding of the commit messages. We create a manually validated set of
identities for a subset of OpenStack developers using an active learning
approach and use it to fit supervised learning models to predict the identities
for the remaining author strings in OpenStack. We compare these predictions
with a commercial effort and a leading research method. Finally, we compare
network measures for file-induced author networks based on corrected and raw
data. We find commits done from different environments, misspellings,
organizational IDs, default values, and anonymous IDs to be the major sources
of errors. We also find supervised learning methods to reduce errors by several
times in comparison to existing methods and the active learning approach to be
an effective way to create validated datasets and that correction of developer
identity has a large impact on the inference of the social network. We believe
that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA)
approach will expedite research progress in the software engineering domain for
applications that depend upon graphs of developers or other social networks.Comment: 12 page
More Effective Software Repository Mining
Background: Data mining and analyzing of public Git software repositories is
a growing research field. The tools used for studies that investigate a single
project or a group of projects have been refined, but it is not clear whether
the results obtained on such ``convenience samples'' generalize. Aims: This
paper aims to elucidate the difficulties faced by researchers who would like to
ascertain the generalizability of their findings by introducing an interface
that addresses the issues with obtaining representative samples. Results: To do
that we explore how to exploit the World of Code system to make software
repository sampling and analysis much more accessible. Specifically, we present
a resource for Mining Software Repository researchers that is intended to
simplify data sampling and retrieval workflow and, through that, increase the
validity and completeness of data. Conclusions: This system has the potential
to provide researchers a resource that greatly eases the difficulty of data
retrieval and addresses many of the currently standing issues with data
sampling.Comment: 5 pages, 3 figures, Submitted to ESEM2020 Emerging Results trac
An Exploratory Study of Bot Commits
Background: Bots help automate many of the tasks performed by software
developers and are widely used to commit code in various social coding
platforms. At present, it is not clear what types of activities these bots
perform and understanding it may help design better bots, and find application
areas which might benefit from bot adoption. Aim: We aim to categorize the Bot
Commits by the type of change (files added, deleted, or modified), find the
more commonly changed file types, and identify the groups of file types that
tend to get updated together. Method: 12,326,137 commits made by 461 popular
bots (that made at least 1000 commits) were examined to identify the frequency
and the type of files added/ deleted/ modified by the commits, and association
rule mining was used to identify the types of files modified together. Result:
Majority of the bot commits modify an existing file, a few of them add new
files, while deletion of a file is very rare. Commits involving more than one
type of operation are even rarer. Files containing data, configuration, and
documentation are most frequently updated, while HTML is the most common type
in terms of the number of files added, deleted, and modified. Files of the type
"Markdown", "Ignore List", "YAML", "JSON" were the types that are updated
together with other types of files most frequently. Conclusion: We observe that
majority of bot commits involve single file modifications, and bots primarily
work with data, configuration, and documentation files. A better understanding
if this is a limitation of the bots and, if overcome, would lead to different
kinds of bots remains an open question
How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools
With the advent of open source software, a veritable treasure trove of
previously proprietary software development data was made available. This
opened the field of empirical software engineering research to anyone in
academia. Data that is mined from software projects, however, requires
extensive processing and needs to be handled with utmost care to ensure valid
conclusions. Since the software development practices and tools have changed
over two decades, we aim to understand the state-of-the-art research workflows
and to highlight potential challenges. We employ a systematic literature review
by sampling over one thousand papers from leading conferences and by analyzing
the 286 most relevant papers from the perspective of data workflows,
methodologies, reproducibility, and tools. We found that an important part of
the research workflow involving dataset selection was particularly problematic,
which raises questions about the generality of the results in existing
literature. Furthermore, we found a considerable number of papers provide
little or no reproducibility instructions -- a substantial deficiency for a
data-intensive field. In fact, 33% of papers provide no information on how
their data was retrieved. Based on these findings, we propose ways to address
these shortcomings via existing tools and also provide recommendations to
improve research workflows and the reproducibility of research.Comment: 11 Page
- …