348 research outputs found

    Automatic Detection of Public Development Projects in Large Open Source Ecosystems: An Exploratory Study on GitHub

    Full text link
    Hosting over 10 million of software projects, GitHub is one of the most important data sources to study behavior of developers and software projects. However, with the increase of the size of open source datasets, the potential threats to mining these datasets have also grown. As the dataset grows, it becomes gradually unrealistic for human to confirm quality of all samples. Some studies have investigated this problem and provided solutions to avoid threats in sample selection, but some of these solutions (e.g., finding development projects) require human intervention. When the amount of data to be processed increases, these semi-automatic solutions become less useful since the effort in need for human intervention is far beyond affordable. To solve this problem, we investigated the GHTorrent dataset and proposed a method to detect public development projects. The results show that our method can effectively improve the sample selection process in two ways: (1) We provide a simple model to automatically select samples (with 0.827 precision and 0.947 recall); (2) We also offer a complex model to help researchers carefully screen samples (with 63.2% less effort than manually confirming all samples, and can achieve 0.926 precision and 0.959 recall).Comment: Accepted by the SEKE2018 Conferenc

    Promises and Perils of Mining Software Package Ecosystem Data

    Full text link
    The use of third-party packages is becoming increasingly popular and has led to the emergence of large software package ecosystems with a maze of inter-dependencies. Since the reliance on these ecosystems enables developers to reduce development effort and increase productivity, it has attracted the interest of researchers: understanding the infrastructure and dynamics of package ecosystems has given rise to approaches for better code reuse, automated updates, and the avoidance of vulnerabilities, to name a few examples. But the reality of these ecosystems also poses challenges to software engineering researchers, such as: How do we obtain the complete network of dependencies along with the corresponding versioning information? What are the boundaries of these package ecosystems? How do we consistently detect dependencies that are declared but not used? How do we consistently identify developers within a package ecosystem? How much of the ecosystem do we need to understand to analyse a single component? How well do our approaches generalise across different programming languages and package ecosystems? In this chapter, we review promises and perils of mining the rich data related to software package ecosystems available to software engineering researchers.Comment: Submitted as a Book Chapte

    We Don't Need Another Hero? The Impact of "Heroes" on Software Development

    Full text link
    A software project has "Hero Developers" when 80% of contributions are delivered by 20% of the developers. Are such heroes a good idea? Are too many heroes bad for software quality? Is it better to have more/less heroes for different kinds of projects? To answer these questions, we studied 661 open source projects from Public open source software (OSS) Github and 171 projects from an Enterprise Github. We find that hero projects are very common. In fact, as projects grow in size, nearly all project become hero projects. These findings motivated us to look more closely at the effects of heroes on software development. Analysis shows that the frequency to close issues and bugs are not significantly affected by the presence of project type (Public or Enterprise). Similarly, the time needed to resolve an issue/bug/enhancement is not affected by heroes or project type. This is a surprising result since, before looking at the data, we expected that increasing heroes on a project will slow down howfast that project reacts to change. However, we do find a statistically significant association between heroes, project types, and enhancement resolution rates. Heroes do not affect enhancement resolution rates in Public projects. However, in Enterprise projects, the more heroes increase the rate at which project complete enhancements. In summary, our empirical results call for a revision of a long-held truism in software engineering. Software heroes are far more common and valuable than suggested by the literature, particularly for medium to large Enterprise developments. Organizations should reflect on better ways to find and retain more of these heroesComment: 8 pages + 1 references, Accepted to International conference on Software Engineering - Software Engineering in Practice, 201

    Promises and Perils of Inferring Personality on GitHub

    Get PDF
    Personality plays a pivotal role in our understanding of human actions and behavior. Today, the applications of personality are widespread, built on the solutions from psychology to infer personality. In software engineering, for instance, one widely used solution to infer personality uses textual communication data. As studies on personality in software engineering continue to grow, it is imperative to understand the performance of these solutions. This paper compares the inferential ability of three widely studied text-based personality tests against each other and the ground truth on GitHub. We explore the challenges and potential solutions to improve the inferential ability of personality tests. Our study shows that solutions for inferring personality are far from being perfect. Software engineering communications data can infer individual developer personality with an average error rate of 41%. In the best case, the error rate can be reduced up to 36% by following our recommendations
    • …
    corecore