    Assessing Code Authorship: The Case of the Linux Kernel

    Code authorship is a key information in large-scale open source systems. Among others, it allows maintainers to assess division of work and identify key collaborators. Interestingly, open-source communities lack guidelines on how to manage authorship. This could be mitigated by setting to build an empirical body of knowledge on how authorship-related measures evolve in successful open-source communities. Towards that direction, we perform a case study on the Linux kernel. Our results show that: (a) only a small portion of developers (26 %) makes significant contributions to the code base; (b) the distribution of the number of files per author is highly skewed --- a small group of top authors (3 %) is responsible for hundreds of files, while most authors (75 %) are responsible for at most 11 files; (c) most authors (62 %) have a specialist profile; (d) authors with a high number of co-authorship connections tend to collaborate with others with less connections.Comment: Accepted at 13th International Conference on Open Source Systems (OSS). 12 page

    Is My Project's Truck Factor Low? Theoretical and Empirical Considerations About the Truck Factor Threshold

    The Truck Factor is a simple way, proposed by the agile community, to measure the system's knowledge distribution in a team of developers. It can be used to highlight potential project problems due to the inadequate distribution of the system knowledge. Notwithstanding its relevance, only few studies investigated the Truck Factor and proposed ways to efficiently measure, evaluate and use it. In particular, the effective use of the Truck Factor is limited by the lack of reliable thresholds. In this preliminary paper, we present a theoretical model concerning the Truck Factor and, in particular, we investigate its use to define the maximum achievable Truck Factor value in a project. The relevance of such a value concerns the definition of a reliable threshold for the Truck Factor. Furthermore in the paper, we document an experiment in which we apply the proposed model to real software projects with the aim of comparing the maximum achievable value of the Truck Factor with the unique threshold proposed in literature. The preliminary outcome we achieved shows that the existing threshold has some limitations and problem

    Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders

    Code review ensures that a peer engineer manually examines the code before it is integrated and released into production. At Meta, we develop a wide range of software at scale, from social networking to software development infrastructure, such as calendar and meeting tools to continuous integration. We are constantly improving our code review system, and in this work we describe a series of experiments that were conducted across 10's of thousands of engineers and 100's of thousands of reviews. We build upon the recommender that has been in production since 2018, RevRecV1. We found that reviewers were being assigned based on prior authorship of files. We reviewed the literature for successful features and experimented with them with RevRecV2 in production. The most important feature in our new model was the familiarity of the author and reviewer, we saw an overall improvement in accuracy of 14 percentage points. Prior research has shown that reviewer workload is skewed. To balance workload, we divide the reviewer score from RevRecV2 by each candidate reviewers workload. We experimented with multiple types of workload to develop RevRecWL. We find that reranking candidate reviewers by workload often leads to a reviewers with lower workload being selected by authors. The bystander effect can occur when a team of reviewers is assigned the review. We mitigate the bystander effect by randomly assigning one of the recommended reviewers. Having an individual who is responsible for the review, reduces the time take for reviews by -11%

    Refining code ownership with synchronous changes

    When mining software repositories, two distinct sources of information are usually explored: the history log and snapshots of the system. Results of analyses derived from these two sources are biased by the frequency with which developers commit their changes. We argue that the usage of mainstream SCM (software configuration management) systems influences the way that developers work. For example, since it is tedious to resolve conflicts due to parallel commits, developers tend to minimize conflicts by not contemporarily modifying the same file. This however defeats one of the purposes of such systems. We mine repositories created by our tool Syde, which records changes in a central repository whenever a file is compiled locally in the IDE (integrated development environment) by any developer in a multi-developer project. This new source of information can augment the accuracy of analyses and breaks new ground in terms of how such information can assist developers. We illustrate how the information we mine provides a refined notion of code ownership with respect to the one inferred by SCM system data. We demonstrate our approach on three case studies, including an industrial one. Ownership models suffer from the assumption that developers have a perfect memory. To account for their imperfect memory, we integrate into our ownership measurement a model of memory retention, to simulate the effect of memory loss over time. We evaluate the characteristics of this model for several strengths of memor

    Mitigating Turnover with Code Review Recommendation: Balancing Expertise, Workload, and Knowledge Distribution

    Developer turnover is inevitable on software projects and leads to knowledge loss, a reduction in productivity, and an increase in defects. Mitigation strategies to deal with turnover tend to disrupt and increase workloads for developers. In this work, we suggest that through code review recommendation we can distribute knowledge and mitigate turnover with minimal impact on the development process. We evaluate review recommenders in the context of ensuring expertise during review, Expertise, reducing the review workload of the core team, CoreWorkload, and reducing the Files at Risk to turnover, FaR. We find that prior work that assigns reviewers based on file ownership concentrates knowledge on a small group of core developers increasing risk of knowledge loss from turnover by up to 65%. We propose learning and retention aware review recommenders that when combined are effective at reducing the risk of turnover by -29% but they unacceptably reduce the overall expertise during reviews by -26%. We develop the Sophia recommender that suggest experts when none of the files under review are hoarded by developers but distributes knowledge when files are at risk. In this way, we are able to simultaneously increase expertise during review with a ΔExpertise of 6%, with a negligible impact on workload of ΔCoreWorkload of 0.09%, and reduce the files at risk by ΔFaR -28%. Sophia is integrated into GitHub pull requests allowing developers to select an appropriate expert or “learner” based on the context of the review. We release the Sophia bot as well as the code and data for replication purposes

    Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

    Authorship attribution of source code has been an established research topic for several decades. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this study, we first introduce a language-agnostic approach to authorship attribution of source code. Two machine learning models based on our approach match or improve over state-of-the-art results, originally achieved by language-specific approaches, on existing datasets for code in C++, Python, and Java. After that, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. In particular, we discuss the concept of work context and its importance for authorship attribution. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We conclude the paper by outlining next steps in design and evaluation of authorship attribution models that could bring the research efforts closer to practical use.Comment: 12 page

    Determinando a taxa de autoria dentro de um projeto usando Git

    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Faculdade UnB Gama (FGA), Engenharia de Software, 2018.Na maior parte das vezes um produto de software moderno não é desenvolvido por apenas uma pessoa mas por uma equipe de desenvolvimento de software. Determinar a quantidade de código que cada desenvolvedor adicionou a um projeto não é uma tarefa difícil bastando apenas somar todas as linhas adicionadas por ele. Entretanto pode ser trabalhosa, dependendo do tamanho do projeto em questão, além do mais saber o quanto de código que esse desenvolvedor produziu ao longo do tempo, pode não ser simples. O objetivo desse trabalho é determinar uma abordagem para avaliar a quantidade de autoria de desenvolvedores, pois tal medida ajuda a analisar questões em várias áreas da Engenharia de Software como, por exemplo, qualidade de software, melhoria da manutenção, análise forense, dentre outros. Em especial, tal medida torna possível determinar uma taxa de autoria para cada desenvolvedor em relação ao projeto inteiro.Most of the time a modern software product is not developed by just one person but by a software development team. Determining the amount of code that each developer added to a project is not a difficult task, just adding up all the lines added by it. However, it may be that the project is subordinate to the project in question, in addition to what is more important is the project produced over time, it can not be simple. The same number of tasks to evaluate a quantity of authorship of developers, such as the measurement of an average in several areas of Software Engineering, for example, software quality, maintenance improvement, forensic analysis, among others. In particular, you can reserve an author fee for each company in relation to the entire project

    Development of Agent-Based Simulation Models for Software Evolution

    Software ist ein Bestandteil des alltäglichen Lebens für uns geworden. Dies ist auch mit zunehmenden Anforderungen an die Anpassungsfähigkeit an sich schnell ändernde Umgebungen verbunden. Dieser evolutionäre Prozess der Software wird von einem dem Software Engineering zugehörigen Forschungsbereich, der Softwareevolution, untersucht. Die Änderungen an einer Software über die Zeit werden durch die Arbeit der Entwickler verursacht. Aus diesem Grund stellt das Entwicklerverhalten einen zentralen Bestandteil dar, wenn man die Evolution eines Softwareprojekts analysieren möchte. Für die Analyse realer Projekte steht eine Vielzahl von Open Source Projekten frei zur Verfügung. Für die Simulation von Softwareprojekten benutzen wir Multiagentensysteme, da wir damit das Verhalten der Entwickler detailliert beschrieben können. In dieser Dissertation entwickeln wir mehrere, aufeinander aufbauende, agentenbasierte Modelle, die unterschiedliche Aspekte der Software Evolution abdecken. Wir beginnen mit einem einfachen Modell ohne Abhängigkeiten zwischen den Agenten, mit dem man allein durch das Entwicklerverhalten das Wachstum eines realen Projekts simulativ reproduzieren kann. Darauffolgende Modelle wurden um weitere Agenten, zum Beispiel unterschiedliche Entwickler-Typen und Fehler, sowie Abhängigkeiten zwischen den Agenten ergänzt. Mit diesen erweiterten Modellen lassen sich unterschiedliche Fragestellungen betreffend Software Evolution simulativ beantworten. Eine dieser Fragen beantwortet zum Beispiel was mit der Software bezüglich ihrer Qualität passiert, wenn der Hauptentwickler das Projekt plötzlich verlässt. Das komplexeste Modell ist in der Lage Software Refactorings zu simulieren und nutzt dazu Graph Transformationen. Die Simulation erzeugt als Ausgabe einen Graphen, der die Software repräsentiert. Als Repräsentant der Software dient der Change-Coupling-Graph, der für die Simulation von Refactorings erweitert wird. Dieser Graph wird in dieser Arbeit als \emph{Softwaregraph} bezeichnet. Um die verschiedenen Modelle zu parametrisieren haben wir unterschiedliche Mining-Werkzeuge entwickelt. Diese Werkzeuge ermöglichen es uns ein Modell mit projektspezifischen Parametern zu instanziieren, ein Modell mit einem Snapshot des analysierten Projektes zu instanziieren oder Transformationsregeln zu parametrisieren, die für die Modellierung von Refactorings benötigt werden. Die Ergebnisse aus drei Fallstudien zeigen unter anderem, dass unser Ansatz agentenbasierte Simulation für die Vorhersage der Evolution von Software Projekten eine geeignete Wahl ist. Des Weiteren konnten wir zeigen, dass mit einer geeigneten Parameterwahl unterschiedliche Wachstumstrends der realen Software simulativ reproduzierbar sind. Die besten Ergebnisse für den simulierten Softwaregraphen erhalten wir, wenn wir die Simulation nach einer initialen Phase mit einem Snapshot der realen Software starten. Die Refactorings betreffend konnten wir zeigen, dass das Modell basierend auf Graph Transformationen anwendbar ist und dass das simulierte Wachstum sich damit leicht verbessern lässt.Software has become a part of everyday life for us. This is also associated with increasing requirements for adaptability to rapidly changing environments. This evolutionary process of software is being studied by a software engineering related research area, called software evolution. The changes to a software over time are caused by the work of the developers. For this reason, the developer contribution behavior is central for analyzing the evolution of a software project. For the analysis of real projects, a variety of open source projects is freely available. For the simulation of software projects, we use multiagent systems because this allows us to describe the behavior of the developers in detail. In this thesis, we develop several successive agent-based models that cover different aspects of software evolution. We start with a simple model with no dependencies between the agents that can simulative reproduce the growth of a real project solely based on the developer’s contribution behavior. Subsequent models were supplemented by additional agents, such as different developer types and bugs, as well as dependencies between the agents. These advanced models can then be used to answer different questions concerning software evolution simulative. For example, one of these questions answers what happens to the software in terms of quality when the core developer suddenly leaves the project. The most complex model can simulate software refactorings based on graph transformations. The simulation output is a graph which represents the software. The representative of the software is the change coupling graph, which is extended for the simulation of refactorings. In this thesis, this graph is denoted as \emph{software graph}. To parameterize these models, we have developed different mining tools. These tools allow us to instantiate a model with project-specific parameters, to instantiate a model with a snapshot of the analyzed project, or to parameterize the transformation rules required to model refactorings. The results of three case studies show, among other things, that our approach to use agent-based simulation is an appropriate choice for predicting the evolution of software projects. Furthermore, we were able to show that different growth trends of the real software can be reproduced simulative with a suitable selection of simulation parameters. The best results for the simulated software graph are obtained when we start the simulation after an initial phase with a snapshot of real software. Regarding refactorings, we were able to show that the model based on graph transformations is applicable and that it can slightly improve the simulated growth