14 research outputs found

    Characterizing Issue Management in Runtime Systems

    Full text link
    Modern programming languages like Java require runtime systems to support the implementation and deployment of software applications in diverse computing platforms and operating systems. These runtime systems are normally developed in GitHub-hosted repositories based on close collaboration between large software companies (e.g., IBM, Microsoft) and OSS developers. However, despite their popularity and broad usage; to the best of our knowledge, these repositories have never been studied. We report an empirical study of around 118K issues from 34 runtime system repos in GitHub. We found that issues regarding enhancement, test failure and bug are mostly posted on runtime system repositories and solution related discussion are mostly present on issue discussion. 82.69% issues in the runtime system repositories have been resolved and 0.69% issues are ignored; median of issue close rate, ignore rate and addressing time in these repositories are 76.1%, 2.2% and 58 days respectively. 82.65% issues are tagged with labels while only 28.30% issues have designated assignees and 90.65% issues contain at least one comment; also presence of these features in an issue report can affect issue closure. Based on the findings, we offer six recommenda

    GitHub: Factors Influencing Project Activity Levels

    Get PDF
    Open source software projects typically extend the capabilities of their software by incorporating code contributions from a diverse cross-section of developers. This GitHub structural path modelling study captures the current top 100 JavaScript projects in operation for at least one year or more. It draws on three theories (information integration, planned behavior, and social translucence) to help frame its comparative path approach, and to show ways to speed the collaborative development of GitHub OSS projects. It shows a project’s activity level increases with: (1) greater responder-group collaborative efforts, (2) increased numbers of major critical project version releases, and (3) the generation of further commits. However, the generation of additional forks negatively impacts overall project activity levels

    Automatic Detection of Public Development Projects in Large Open Source Ecosystems: An Exploratory Study on GitHub

    Full text link
    Hosting over 10 million of software projects, GitHub is one of the most important data sources to study behavior of developers and software projects. However, with the increase of the size of open source datasets, the potential threats to mining these datasets have also grown. As the dataset grows, it becomes gradually unrealistic for human to confirm quality of all samples. Some studies have investigated this problem and provided solutions to avoid threats in sample selection, but some of these solutions (e.g., finding development projects) require human intervention. When the amount of data to be processed increases, these semi-automatic solutions become less useful since the effort in need for human intervention is far beyond affordable. To solve this problem, we investigated the GHTorrent dataset and proposed a method to detect public development projects. The results show that our method can effectively improve the sample selection process in two ways: (1) We provide a simple model to automatically select samples (with 0.827 precision and 0.947 recall); (2) We also offer a complex model to help researchers carefully screen samples (with 63.2% less effort than manually confirming all samples, and can achieve 0.926 precision and 0.959 recall).Comment: Accepted by the SEKE2018 Conferenc

    Analysing Big Data Projects Using Github and JavaScript Repositories

    Get PDF
    GitHub open source software developers remain in short supply. Successful GitHub projects offer multiple pathways for developers to contribute into their repositories. This study’s GitHub JavaScript big data is path modelled to provide understanding of the different significant developer contribution pathways towards raising the project’s activity level. Its significant pathways offer the project’s creator benchmark decision making capabilities that can be used to trigger faster project software development through to its next completion point. This approach has behavioural consumptive value connotations that may provide a future pathway towards tapping big data sources and to also delivering real business values

    Open source software GitHub ecosystem: a SEM approach

    Get PDF
    Open source software (OSS) is a collaborative effort. Getting affordable high-quality software with less probability of errors or fails is not far away. Thousands of open-source projects (termed repos) are alternatives to proprietary software development. More than two-thirds of companies are contributing to open source. Open source technologies like OpenStack, Docker and KVM are being used to build the next generation of digital infrastructure. An iconic example of OSS is 'GitHub' - a successful social site. GitHub is a hosting platform that host repositories (repos) based on the Git version control system. GitHub is a knowledge-based workspace. It has several features that facilitate user communication and work integration. Through this thesis I employ data extracted from GitHub, and seek to better understand the OSS ecosystem, and to what extent each of its deployed elements affects the successful development of the OSS ecosystem. In addition, I investigate a repo's growth over different time periods to test the changing behavior of the repo. From our observations developers do not follow one development methodology when developing, and growing their project, and such developers tend to cherry-pick from differing available software methodologies. GitHub API remains the main OSS location engaged to extract the metadata for this thesis's research. This extraction process is time-consuming - due to restrictive access limitations (even with authentication). I apply Structure Equation Modelling (termed SEM) to investigate the relative path relationships between the GitHub- deployed OSS elements, and I determine the path strength contributions of each element to determine the OSS repo's activity level. SEM is a multivariate statistical analysis technique used to analyze structural relationships. This technique is the combination of factor analysis and multiple regression analysis. It is used to analyze the structural relationship between measured variables and/or latent constructs. This thesis bridges the research gap around longitude OSS studies. It engages large sample-size OSS repo metadata sets, data-quality control, and multiple programming language comparisons. Querying GitHub is not direct (nor simple) yet querying for all valid repos remains important - as sometimes illegal, or unrepresentative outlier repos (which may even be quite popular) do arise, and these then need to be removed from each initial OSS's language-specific metadata set. Eight top GitHub programming languages, (selected as the most forked repos) are separately engaged in this thesis's research. This thesis observes these eight metadata sets of GitHub repos. Over time, it measures the different repo contributions of the deployed elements of each metadata set. The number of stars-provided to the repo delivers a weaker contribution to its software development processes. Sometimes forks work against the repo's progress by generating very minor negative total effects into its commit (activity) level, and by sometimes diluting the focus of the repo's software development strategies. Here, a fork may generate new ideas, create a new repo, and then draw some original repo developers off into this new software development direction, thus retarding the original repo's commit (activity) level progression. Multiple intermittent and minor version releases exert lesser GitHub JavaScript repo commit (or activity) changes because they often involve only slight OSS improvements, and because they only require minimal commit/commits contributions. More commit(s) also bring more changes to documentation, and again the GitHub OSS repo's commit (activity) level rises. There are both direct and indirect drivers of the repo's OSS activity. Pulls and commits are the strongest drivers. This suggests creating higher levels of pull requests is likely a preferred prime target consideration for the repo creator's core team of developers. This study offers a big data direction for future work. It allows for the deployment of more sophisticated statistical comparison techniques. It offers further indications around the internal and broad relationships that likely exist between GitHub's OSS big data. Its data extraction ideas suggest a link through to business/consumer consumption, and possibly how these may be connected using improved repo search algorithms that release individual business value components

    Improving Software Project Health Using Machine Learning

    Get PDF
    In recent years, systems that would previously live on different platforms have been integrated under a single umbrella. The increased use of GitHub, which offers pull-requests, issue trackingand version history, and its integration with other solutions such as Gerrit, or Travis, as well as theresponse from competitors, created development environments that favour agile methodologiesby increasingly automating non-coding tasks: automated build systems, automated issue triagingetc. In essence, source-code hosting platforms shifted to continuous integration/continuousdelivery (CI/CD) as a service. This facilitated a shift in development paradigms, adherents ofagile methodology can now adopt a CI/CD infrastructure more easily. This has also created large,publicly accessible sources of source-code together with related project artefacts: GHTorrent andsimilar datasets now offer programmatic access to the whole of GitHub. Project health encompasses traceability, documentation, adherence to coding conventions,tasks that reduce maintenance costs and increase accountability, but may not directly impactfeatures. Overfocus on health can slow velocity (new feature delivery) so the Agile Manifestosuggests developers should travel light — forgo tasks focused on a project health in favourof higher feature velocity. Obviously, injudiciously following this suggestion can undermine aproject’s chances for success. Simultaneously, this shift to CI/CD has allowed the proliferation of Natural Language orNatural Language and Formal Language textual artefacts that are programmatically accessible:GitHub and their competitors allow API access to their infrastructure to enable the creation ofCI/CD bots. This suggests that approaches from Natural Language Processing and MachineLearning are now feasible and indeed desirable. This thesis aims to (semi-)automate tasks forthis new paradigm and its attendant infrastructure by bringing to the foreground the relevant NLPand ML techniques. Under this umbrella, I focus on three synergistic tasks from this domain: (1) improving theissue-pull-request traceability, which can aid existing systems to automatically curate the issuebacklog as pull-requests are merged; (2) untangling commits in a version history, which canaid the beforementioned traceability task as well as improve the usability of determining a faultintroducing commit, or cherry-picking via tools such as git bisect; (3) mixed-text parsing, to allowbetter API mining and open new avenues for project-specific code-recommendation tools

    “Won’t we fix this issue?” : qualitative characterization and automated identification of wontfix issues on GitHub

    Get PDF
    Context: Addressing user requests in the form of bug reports and Github issues represents a crucial task of any successful software project. However, user-submitted issue reports tend to widely differ in their quality, and developers spend a considerable amount of time handling them. Objective: By collecting a dataset of around 6,000 issues of 279 GitHub projects, we observe that developers take significant time (i.e., about five months, on average) before labeling an issue as a wontfix. For this reason, in this paper, we empirically investigate the nature of wontfix issues and methods to facilitate issue management process. Method: We first manually analyze a sample of 667 wontfix issues, extracted from heterogeneous projects, investigating the common reasons behind a “wontfix decision”, the main characteristics of wontfix issues and the potential factors that could be connected with the time to close them. Furthermore, we experiment with approaches enabling the prediction of wontfix issues by analyzing the titles and descriptions of reported issues when submitted. Results and conclusion: Our investigation sheds some light on the wontfix issues’ characteristics, as well as the potential factors that may affect the time required to make a “wontfix decision”. Our results also demonstrate that it is possible to perform prediction of wontfix issues with high average values of precision, recall, and F-measure (90%-93%)

    Supporting Development Decisions with Software Analytics

    Get PDF
    Software practitioners make technical and business decisions based on the understanding they have of their software systems. This understanding is grounded in their own experiences, but can be augmented by studying various kinds of development artifacts, including source code, bug reports, version control meta-data, test cases, usage logs, etc. Unfortunately, the information contained in these artifacts is typically not organized in the way that is immediately useful to developers’ everyday decision making needs. To handle the large volumes of data, many practitioners and researchers have turned to analytics — that is, the use of analysis, data, and systematic reasoning for making decisions. The thesis of this dissertation is that by employing software analytics to various development tasks and activities, we can provide software practitioners better insights into their processes, systems, products, and users, to help them make more informed data-driven decisions. While quantitative analytics can help project managers understand the big picture of their systems, plan for its future, and monitor trends, qualitative analytics can enable developers to perform their daily tasks and activities more quickly by helping them better manage high volumes of information. To support this thesis, we provide three different examples of employing software analytics. First, we show how analysis of real-world usage data can be used to assess user dynamic behaviour and adoption trends of a software system by revealing valuable information on how software systems are used in practice. Second, we have created a lifecycle model that synthesizes knowledge from software development artifacts, such as reported issues, source code, discussions, community contributions, etc. Lifecycle models capture the dynamic nature of how various development artifacts change over time in an annotated graphical form that can be easily understood and communicated. We demonstrate how lifecycle models can be generated and present industrial case studies where we apply these models to assess the code review process of three different projects. Third, we present a developer-centric approach to issue tracking that aims to reduce information overload and improve developers’ situational awareness. Our approach is motivated by a grounded theory study of developer interviews, which suggests that customized views of a project’s repositories that are tailored to developer-specific tasks can help developers better track their progress and understand the surrounding technical context of their working environments. We have created a model of the kinds of information elements that developers feel are essential in completing their daily tasks, and from this model we have developed a prototype tool organized around developer-specific customized dashboards. The results of these three studies show that software analytics can inform evidence-based decisions related to user adoption of a software project, code review processes, and improved developers’ awareness on their daily tasks and activities
    corecore