9 research outputs found

    Understanding the Heterogeneity of Contributors in Bug Bounty Programs

    Full text link
    Background: While bug bounty programs are not new in software development, an increasing number of companies, as well as open source projects, rely on external parties to perform the security assessment of their software for reward. However, there is relatively little empirical knowledge about the characteristics of bug bounty program contributors. Aim: This paper aims to understand those contributors by highlighting the heterogeneity among them. Method: We analyzed the histories of 82 bug bounty programs and 2,504 distinct bug bounty contributors, and conducted a quantitative and qualitative survey. Results: We found that there are project-specific and non-specific contributors who have different motivations for contributing to the products and organizations. Conclusions: Our findings provide insights to make bug bounty programs better and for further studies of new software development roles.Comment: 6 pages, ESEM 201

    A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

    Full text link
    The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs

    Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant

    No full text
    Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants.Igor Scaliante Wiese, José Teodoro da Silva, Igor Steinmacher, Christoph Treude, Marco Aurélio Geros

    Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data

    Get PDF
    Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical
    corecore