174 research outputs found
Towards Identifying Paid Open Source Developers - A Case Study with Mozilla Developers
Open source development contains contributions from both hired and volunteer
software developers. Identification of this status is important when we
consider the transferability of research results to the closed source software
industry, as they include no volunteer developers. While many studies have
taken the employment status of developers into account, this information is
often gathered manually due to the lack of accurate automatic methods. In this
paper, we present an initial step towards predicting paid and unpaid open
source development using machine learning and compare our results with
automatic techniques used in prior work. By relying on code source repository
meta-data from Mozilla, and manually collected employment status, we built a
dataset of the most active developers, both volunteer and hired by Mozilla. We
define a set of metrics based on developers' usual commit time pattern and use
different classification methods (logistic regression, classification tree, and
random forest). The results show that our proposed method identify paid and
unpaid commits with an AUC of 0.75 using random forest, which is higher than
the AUC of 0.64 obtained with the best of the previously used automatic
methods.Comment: International Conference on Mining Software Repositories (MSR) 201
Bayesian Hierarchical Modelling for Tailoring Metric Thresholds
Software is highly contextual. While there are cross-cutting `global'
lessons, individual software projects exhibit many `local' properties. This
data heterogeneity makes drawing local conclusions from global data dangerous.
A key research challenge is to construct locally accurate prediction models
that are informed by global characteristics and data volumes. Previous work has
tackled this problem using clustering and transfer learning approaches, which
identify locally similar characteristics. This paper applies a simpler approach
known as Bayesian hierarchical modeling. We show that hierarchical modeling
supports cross-project comparisons, while preserving local context. To
demonstrate the approach, we conduct a conceptual replication of an existing
study on setting software metrics thresholds. Our emerging results show our
hierarchical model reduces model prediction error compared to a global approach
by up to 50%.Comment: Short paper, published at MSR '18: 15th International Conference on
Mining Software Repositories May 28--29, 2018, Gothenburg, Swede
A Benchmark Study on Sentiment Analysis for Software Engineering Research
A recent research trend has emerged to identify developers' emotions, by
applying sentiment analysis to the content of communication traces left in
collaborative development environments. Trying to overcome the limitations
posed by using off-the-shelf sentiment analysis tools, researchers recently
started to develop their own tools for the software engineering domain. In this
paper, we report a benchmark study to assess the performance and reliability of
three sentiment analysis tools specifically customized for software
engineering. Furthermore, we offer a reflection on the open challenges, as they
emerge from a qualitative analysis of misclassified texts.Comment: Proceedings of 15th International Conference on Mining Software
Repositories (MSR 2018
We Don't Need Another Hero? The Impact of "Heroes" on Software Development
A software project has "Hero Developers" when 80% of contributions are
delivered by 20% of the developers. Are such heroes a good idea? Are too many
heroes bad for software quality? Is it better to have more/less heroes for
different kinds of projects? To answer these questions, we studied 661 open
source projects from Public open source software (OSS) Github and 171 projects
from an Enterprise Github.
We find that hero projects are very common. In fact, as projects grow in
size, nearly all project become hero projects. These findings motivated us to
look more closely at the effects of heroes on software development. Analysis
shows that the frequency to close issues and bugs are not significantly
affected by the presence of project type (Public or Enterprise). Similarly, the
time needed to resolve an issue/bug/enhancement is not affected by heroes or
project type. This is a surprising result since, before looking at the data, we
expected that increasing heroes on a project will slow down howfast that
project reacts to change. However, we do find a statistically significant
association between heroes, project types, and enhancement resolution rates.
Heroes do not affect enhancement resolution rates in Public projects. However,
in Enterprise projects, the more heroes increase the rate at which project
complete enhancements.
In summary, our empirical results call for a revision of a long-held truism
in software engineering. Software heroes are far more common and valuable than
suggested by the literature, particularly for medium to large Enterprise
developments. Organizations should reflect on better ways to find and retain
more of these heroesComment: 8 pages + 1 references, Accepted to International conference on
Software Engineering - Software Engineering in Practice, 201
Opinion Mining for Software Development: A Systematic Literature Review
Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies.
SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in
code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take
considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils
these approaches entail.
We conducted a systematic literature review involving 185 papers. More specifically, we present 1) well-defined categories of opinion
mining-related software development activities, 2) available opinion mining approaches, whether they are evaluated when adopted in
other studies, and how their performance is compared, 3) available datasets for performance evaluation and tool customization, and 4)
concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques.
The results of our study serve as references to choose suitable opinion mining tools for software development activities, and provide
critical insights for the further development of opinion mining techniques in the SE domain
WeakSATD: Detecting Weak Self-admitted Technical Debt
Speeding up development may produce technical debt, i.e., not-quite-right code for which the effort to make it right increases with time as a sort of interest. Developers may be aware of the debt as they admit it in their code comments. Literature reports that such a self-admitted technical debt survives for a long time in a program, but it is not yet clear its impact on the quality of the code in the long term. We argue that self-admitted technical debt contains a number of different weaknesses that may affect the security of a program. Therefore, the longer a debt is not paid back the higher is the risk that the weaknesses can be exploited. To discuss our claim and rise the developers' awareness of the vulnerability of the self-admitted technical debt that is not paid back, we explore the self-admitted technical debt in the Chromium C-code to detect any known weaknesses. In this preliminary study, we first mine the Common Weakness Enumeration repository to define heuristics for the automatic detection and fix of weak code. Then, we parse the C-code to find self-admitted technical debt and the code block it refers to. Finally, we use the heuristics to find weak code snippets associated to self-admitted technical debt and recommend their potential mitigation to developers. Such knowledge can be used to prioritize self-admitted technical debt for repair. A prototype has been developed and applied to the Chromium code. Initial findings report that 55% of self-admitted technical debt code contains weak code of 14 different types
Structured information on state and evolution of dockerfiles on github
<p>Docker containers are standardized, self-contained units of applications, packaged with their dependencies and execution environment. The environment is defined in a Dockerfile that specifies the steps to reach a certain system state as infrastructure code, with the aim of enabling reproducible builds of the container. To lay the groundwork for research on infrastructure code, we collected structured information about the state and the evolution of Dockerfiles on GitHub and release it as a PostgreSQL database archive (over 100,000 unique Dockerfiles in over 15,000 GitHub projects). Our dataset enables answering a multitude of interesting research questions related to different kinds of software evolution behavior in the Docker ecosystem.</p
Restmule : Enabling resilient clients for remote APIs
Mining data from remote repositories, such as GitHub and StackExchange, involves the execution of requests that can easily reach the limitations imposed by the respective APIs to shield their services from overload and abuse. Therefore, data mining clients are left alone to deal with such protective service policies which usually involves an extensive amount of manual implementation effort. In this work we present RestMule, a framework for handling various service policies, such as limited number of requests within a period of time and multi-page responses, by generating resilient clients that are able to handle request rate limits, network failures, response caching, and paging in a graceful and transparent manner. As a result, RestMule clients generated from OpenAPI specifications (i.e. standardized REST API descriptors), are suitable for intensive data-fetching scenarios. We evaluate our framework by reproducing an existing repository mining use case and comparing the results produced by employing a popular hand-written client and a RestMule client
- …