233 research outputs found

    Software Supply Chain Development and Application

    Get PDF
    Motivation: Free Libre Open Source Software (FLOSS) has become a critical componentin numerous devices and applications. Despite its importance, it is not clear why FLOSS ecosystem works so well or if it may cease to function. Majority of existing research is focusedon studying a specific software project or a portion of an ecosystem, but FLOSS has not been investigated in its entirety. Such view is necessary because of the deep and complex technical and social dependencies that go beyond the core of an individual ecosystem and tight inter-dependencies among ecosystems within FLOSS.Aim: We, therefore, aim to discover underlying relations within and across FLOSS projects and developers in open source community, mitigate potential risks induced by the lack of such knowledge and enable systematic analysis over entire open source community through the lens of supply chain (SC).Method: We utilize concepts from an area of supply chains to model risks of FLOSS ecosystem. FLOSS, due to the distributed decision making of software developers, technical dependencies, and copying of the code, has similarities to traditional supply chain. Unlike in traditional supply chain, where data is proprietary and distributed among players, we aim to measure open-source software supply chain (OSSC) by operationalizing supply chain concept in software domain using traces reconstructed from version control data.Results: We create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems, and provide capabilities to efficiently correct, augment, query, and analyze that data. Various researches and applications (e.g., software technology adoption investigation) have been successfully implemented by leveraging the combination of WoC and OSSC.Implications: With a SC perspective in FLOSS development and the increased visibility and transparency in OSSC, our work provides potential opportunities for researchers to conduct wider and deeper studies on OSS over entire FLOSS community, for developers to build more robust software and for students to learn technologies more efficiently and improve programming skills

    Enhancing Software Project Outcomes: Using Machine Learning and Open Source Data to Employ Software Project Performance Determinants

    Get PDF
    Many factors can influence the ongoing management and execution of technology projects. Some of these elements are known a priori during the project planning phase. Others require real-time data gathering and analysis throughout the lifetime of a project. These real-time project data elements are often neglected, misclassified, or otherwise misinterpreted during the project execution phase resulting in increased risk of delays, quality issues, and missed business opportunities. The overarching motivation for this research endeavor is to offer reliable improvements in software technology management and delivery. The primary purpose is to discover and analyze the impact, role, and level of influence of various project related data on the ongoing management of technology projects. The study leverages open source data regarding software performance attributes. The goal is to temper the subjectivity currently used by project managers (PMs) with quantifiable measures when assessing project execution progress. Modern-day PMs who manage software development projects are charged with an arduous task. Often, they obtain their inputs from technical leads who tend to be significantly more technical. When assessing software projects, PMs perform their role subject to the limitations of their capabilities and competencies. PMs are required to contend with the stresses of the business environment, the policies, and procedures dictated by their organizations, and resource constraints. The second purpose of this research study is to propose methods by which conventional project assessment processes can be enhanced using quantitative methods that utilize real-time project execution data. Transferability of academic research to industry application is specifically addressed vis-Ă -vis a delivery framework to provide meaningful data to industry practitioners

    FLOSSSim: Understanding the Free/Libre Open Source Software (FLOSS) Development Process through Agent-Based Modeling

    Get PDF
    abstract: Free/Libre Open Source Software (FLOSS) is the product of volunteers collaborating to build software in an open, public manner. The large number of FLOSS projects, combined with the data that is inherently archived with this online process, make studying this phenomenon attractive. Some FLOSS projects are very functional, well-known, and successful, such as Linux, the Apache Web Server, and Firefox. However, for every successful FLOSS project there are 100's of projects that are unsuccessful. These projects fail to attract sufficient interest from developers and users and become inactive or abandoned before useful functionality is achieved. The goal of this research is to better understand the open source development process and gain insight into why some FLOSS projects succeed while others fail. This dissertation presents an agent-based model of the FLOSS development process. The model is built around the concept that projects must manage to attract contributions from a limited pool of participants in order to progress. In the model developer and user agents select from a landscape of competing FLOSS projects based on perceived utility. Via the selections that are made and subsequent contributions, some projects are propelled to success while others remain stagnant and inactive. Findings from a diverse set of empirical studies of FLOSS projects are used to formulate the model, which is then calibrated on empirical data from multiple sources of public FLOSS data. The model is able to reproduce key characteristics observed in the FLOSS domain and is capable of making accurate predictions. The model is used to gain a better understanding of the FLOSS development process, including what it means for FLOSS projects to be successful and what conditions increase the probability of project success. It is shown that FLOSS is a producer-driven process, and project factors that are important for developers selecting projects are identified. In addition, it is shown that projects are sensitive to when core developers make contributions, and the exhibited bandwagon effects mean that some projects will be successful regardless of competing projects. Recommendations for improving software engineering in general based on the positive characteristics of FLOSS are also presented.Dissertation/ThesisPh.D. Computer Science 201

    CodeDJ: Reproducible Queries over Large-Scale Software Repositories

    Get PDF

    A heuristic-based approach to code-smell detection

    Get PDF
    Encapsulation and data hiding are central tenets of the object oriented paradigm. Deciding what data and behaviour to form into a class and where to draw the line between its public and private details can make the difference between a class that is an understandable, flexible and reusable abstraction and one which is not. This decision is a difficult one and may easily result in poor encapsulation which can then have serious implications for a number of system qualities. It is often hard to identify such encapsulation problems within large software systems until they cause a maintenance problem (which is usually too late) and attempting to perform such analysis manually can also be tedious and error prone. Two of the common encapsulation problems that can arise as a consequence of this decomposition process are data classes and god classes. Typically, these two problems occur together – data classes are lacking in functionality that has typically been sucked into an over-complicated and domineering god class. This paper describes the architecture of a tool which automatically detects data and god classes that has been developed as a plug-in for the Eclipse IDE. The technique has been evaluated in a controlled study on two large open source systems which compare the tool results to similar work by Marinescu, who employs a metrics-based approach to detecting such features. The study provides some valuable insights into the strengths and weaknesses of the two approache

    Process Models for Learning Patterns in FLOSS Repositories

    Get PDF
    Evidence suggests that Free/Libre Open Source Software (FLOSS) environments provide unlimited learning opportunities. Community members engage in a number of activities both during their interaction with their peers and while making use of these environments’ repositories. To date, numerous studies document the existence of learning processes in FLOSS through surveys or by means of questionnaires filled by FLOSS projects participants. At the same time, there is a surge in developing tools and techniques for extracting and analyzing data from different FLOSS data sources that has birthed a new field called Mining Software Repositories (MSR). In spite of these growing tools and techniques for mining FLOSS repositories, there is limited or no existing approaches to providing empirical evidence of learning processes directly from these repositories. Therefore, in this work we sought to trigger such an initiative by proposing an approach based on Process Mining. With this technique, we aim to trace learning behaviors from FLOSS participants’ trails of activities as recorded in FLOSS repositories. We identify the participants as Novices and Experts. A Novice is defined as any FLOSS member that benefits from a learning experience through acquiring new skills while the Expert is the provider of these skills. The significance of our work is mainly twofold. First and foremost, we extend the MSR field by showing the potential of mining FLOSS repositories by applying Process Mining techniques. Lastly, our work provides critical evidence that boosts the understanding of learning behavior in FLOSS communities by analyzing the relevant repositories. In order to accomplish this, we have proposed and implemented a methodology that follows a seven-step approach including developing an appropriate terminology or ontology for learning processes in FLOSS, contextualizing learning processes through a-priori models, generating Event Logs, generating corresponding process models, interpreting and evaluating the value of process discovery, performing conformance analysis and verifying a number of formulated hypotheses with regard to tracing learning patterns in FLOSS communities. The implementation of this approach has resulted in the development of the Ontology of Learning in FLOSS (OntoLiFLOSS) environments that defines the terms needed to describe learning processes in FLOSS as well as providing a visual representation of these processes through Petri net-like Workflow nets. Moreover, another novelty pertains to the mining of FLOSS repositories by defining and describing the preliminaries required for preprocessing FLOSS data before applying Process Mining techniques for analysis. Through a step-by-step process, we effectively detail how the Event Logs are constructed through generating key phrases and making use of Semantic Search. Taking a FLOSS environment called Openstack as our data source, we apply our proposed techniques to identify learning activities based on key phrases catalogs and classification rules expressed through pseudo code as well as the appropriate Process Mining tool. We thus produced Event Logs that are based on the semantic content of messages in Openstack’s Mailing archives, Internet Relay Chat (IRC) messages, Reviews, Bug reports and Source code to retrieve the corresponding activities. Considering these repositories in light of the three learning process phases (Initiation, Progression and maturation), we produced an Event Log for each participant (Novice or Expert) in every phase on the corresponding dataset. Hence, we produced 14 Event Logs that helped build 14 corresponding process maps which are visual representation of the flow occurrence of learning activities in FLOSS for each participant. These process maps provide critical indications that speak volumes in terms of the presence of learning processes in the analyzed repositories. The results show that learning activities do occur at a significant rate during messages exchange on both Mailing archives and IRC messages. The slight differences between the two datasets can be highlighted in two ways. First, the involvement of Experts is more on iv IRC than it is on Mailing archives with 7.22% and 0.36% of Expert involvement respectively on IRC forums and Mailing lists. This can be justified by the differences in the length of messages sent on these two datasets. The average length of sent messages is 3261 characters for an email compared to 60 characters for a chat message. The evidence produced from this mining experiment solidifies the finding in terms of the existence of learning processes in FLOSS as well as the scale at which they occur. While the Initiation phase shows the Novice as the most involved in the start of the learning process, during Progression phase the involvement of the Expert can be seen to be significantly increasing. In order to trace the advanced skills in the Maturation phase, we look at repositories that store data about developing, creating code, examining and reviewing the code, identifying and fixing possible bugs. Therefore, we consider three repositories including Source Code, Bug reports and Reviews. The results obtained in this phase largely justify the choice of these three datasets to track learning behavior at this stage. Both the Bug reports and the Source code demonstrate the commitment of the Novice to seek answers and interact as much as possible in strengthening the acquired skills. With a participation of 49.22% for the Novice against 46.72% for the Expert and 46.19 % against 42.04% respectively on Bug reports and Source code, the Novice still engages significantly in learning. On the last dataset, Reviews, we notice an increase in the Expert’s role. The Expert performs activities to the tune of 40.36 % of total number of activities against 22.17 % for the Novice. The last steps of our methodology steer the comparison of the defined a-priori models with final models that describe how learning processes occur according to the actual behavior from Event Logs. Our attempts to producing process models start with depicting process maps to track the actual behaviour as it occurs in Openstack repositories, before concluding with final Petri net models representative of learning processes in FLOSS as a result of conformance analysis. For every dataset in the corresponding learning phase, we produce 3 process maps respectively depicting the overall learning behaviour for all FLOSS community members (Novice or Expert together), then the Novice and Expert. In total, we produced 21 process maps, empirically describing process models on real data, 14 process models in the form of Petri nets for every participant on each dataset. We make use of the Artificial Immune System (AIS) algorithms to merge the 14 Event Logs that uniquely capture the behaviour of every participant on different datasets in the three phases. We then reanalyze the resulting logs in order to produce 6 global models that inclusively provide a comprehensive depiction of participants’ learning behavior in FLOSS communities. This description hints that Workflow nets introduced as our a-priori models give rather a more simplistic representation of learning processes in FLOSS. Nevertheless, our experiments with Event Logs starting from process discovery to conformance checking from Openstack repositories demonstrate that the real learning behaviors are more complete and most importantly largely submerge these simplistic a-priori models. Finally, our methodology has proved to be effective in both providing a novel alternative for mining FLOSS repositories and providing empirical evidence that describes how knowledge is exchanged in FLOSS environments. Moreover, our results enrich the MSR field by providing a reproducible step-by-step problem solving approach that can be customized to answer subsequent research questions in FLOSS repositories using Process Mining

    Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data

    Get PDF
    Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical
    • …
    corecore