25 research outputs found

    Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data

    Get PDF
    Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical

    A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

    Full text link
    The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs

    Social aspects of collaboration in online software communities

    Get PDF

    Effects of Personality Traits and Emotional Factors in Pull Request Acceptance.

    Get PDF
    Social interactions in the form of discussion are an indispensable part of collaborative software development. The discussions are essential for developers to share their views and to form a strong relationship with other teammates. These discussions invoke both positive and negative emotions such as joy, love, aggression, and disgust. Additionally, developers also exhibit hidden behaviours that dictate their personality. Some developers can be supportive and open to new ideas, whereas others can be conservative. Past research has shown that the personality of the developers has a significant role in determining the success of the task they collaboratively perform. Additionally, previous research has also shown that in online collaborative environments, the developers use signals from comments such as rudeness to determine if they are compatible to work together. Most of these studies use traditional small-scale surveys for their experiments. The transparent nature of online collaborative environments makes it easier to conduct empirical experiments by mining pull request comments. In this thesis, first, we investigate the effect of different personality traits on pull request acceptance. The results of this experiment will provide us with a valuable understanding of the personality traits of developers and help us develop tools to assist developers. We follow it with a second experiment to understand the influence of different emotional factors on pull request decisions. The emotion expressed by a developer on their teammates can be influenced by social statuses, such as the number of followers. Moreover, the teammate's team status, such as team member or outside contributor too, can influence the emotional effect. To understand moderation, we investigate different interaction effects. We start the experiment by replicating Tsay et al.'s work that examined the influence of social factors (e.g., `social distance') and technical factors (e.g., test file inclusion) for evaluating contributions. We extend their work by augmenting it with personality traits of developers and examining the influence of on the pull request evaluation process in GitHub. In particular, we extract the `Big Five' personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) of developers from their online digital footprints, such as pull request comments. We analyze the personality traits of 16,935 active developers from 1,860 projects and compare their relative importance to other non-personality factors from past research, in the pull request evaluation process. We find that pull requests from authors (requesters) who are more open and conscientious, but less extroverted, have a higher chance of approval. Furthermore, pull requests that are closed by developers (closers) who are more conscientious, extroverted, and neurotic, have a higher likelihood of acceptance. The larger the difference in personality traits between the requester and the closer, the more positive effect it has on pull request acceptance. Although the effect of personality traits is significant and comparable to technical factors, we find that social factors are still more influential when it comes to the effect in the likelihood of pull request acceptance. We perform a second experiment to analyze the effect of emotions on pull request decisions. To predict emotions in the comments, we develop a generalised, software engineering specific language model that outperforms previous machine learning algorithms on four different standard datasets. We find that the percentage of positive comments from both requester and closer has a positive association with pull request acceptance, whereas the percentage of negative comments has a negative association. Also, the polarity of the emotion associated with the first comment of both requester and closer had a positive association with pull request acceptance, i.e., more positive the emotion, the higher the likelihood of acceptance. Finally, we find that social factors moderate the effects of emotions

    Utilizing traceable software artifacts to improve bug localization

    Get PDF
    Die Entwicklung von Softwaresystemen ist eine komplexe Aufgabe. Qualitätssicherung versucht auftretenden Softwarefehler (bugs) in Systemen zu vermeiden, jedoch können Fehler nie ausgeschlossen werden. Sobald ein Softwarefehler entdeckt wird, wird typischerweise ein Fehlerbericht (bug report) erstellt. Dieser dient als Ausgangspunkt für den Entwickler den Fehler im Quellcode der Software zu finden und zu beheben (bug fixing). Fehlerberichte sowie weitere Softwareartefakte, z.B. Anforderungen und der Quellcode selbst, werden in Software Repositories abgelegt. Diese erlauben die Artefakte mit trace links zur Nachvollziehbarkeit (traceability) zu verknüpfen. Oftmals ist die Erstellung der trace links im Entwicklungsprozess vorgeschrieben. Dazu zählen u.a. die Luftfahrt- und Automobilindustrie, sowie die Entwicklung von medizinischen Geräten. Das Auffinden von Softwarefehlern in großen Systemen mit tausenden Artefakten ist eine anspruchsvolle, zeitintensive und fehleranfällige Aufgabe, welche eine umfangreiche Projektkenntnis erfordert. Deswegen wird seit Jahren aktiv an der Automatisierung dieses Prozesses geforscht. Weiterhin wird die manuelle Erstellung und Pflege von trace links als Belastung empfunden und sollte weitgehend automatisiert werden. In dieser Arbeit wird ein neuartiger Algorithmus zum Auffinden von Softwarefehlern vorgestellt, der aktiv die erstellten trace links ausnutzt. Die Artefakte und deren Beziehungen dienen zur Erstellung eines Nachvollziehbarkeitsgraphen, welcher analysiert wird um fehlerhafte Quellcodedateien anhand eines Fehlerberichtes zu finden. Jedoch muss angenommen werden, dass nicht alle notwendigen trace links zwischen den Softwareartefakten eines Projektes erstellt wurden. Deswegen wird ein vollautomatisierter, projektunabhängiger Ansatz vorgestellt, der diese fehlenden trace links erstellt (augmentation). Die Grundlage zur Entwicklung dieses Algorithmus ist der typische Entwicklungsprozess eines Softwareprojektes. Die entwickelten Ansätze wurden mit mehr als 32.000 Fehlerberichten von 27 Open-Source Projekten evaluiert und die Ergebnisse zeigen, dass die Einbeziehung von traceability signifikant das Auffinden von Fehlern im Quellcode verbessert. Weiterhin kann der entwickelte Augmentation Algorithmus zuverlässig fehlende trace links erstellen.The development of software systems is a very complex task. Quality assurance tries to prevent defects – software bugs – in deployed systems, but it is impossible to avoid bugs all together, especially during development. Once a bug is observed, typically a bug report is written. It guides the responsible developer to locate the bug in the project's source code, and once found to fix it. The bug reports, along with other development artifacts such as requirements and the source code are stored in software repositories. The repositories also allow to create relationships – trace links – among contained artifacts. Establishing this traceability is demanded in many domains, such as safety related ones like the automotive and aviation industry, or in development of medical devices. However, in large software systems with thousands of artifacts, especially source code files, manually locating a bug is time consuming, error-prone, and requires extensive knowledge of the project. Thus, automating the bug localization process is actively researched since many years. Further, manually creating and maintaining trace links is often considered as a burden, and there is the need to automate this task as well. Multiple studies have shown, that traceability is beneficial for many software development tasks. This thesis presents a novel bug localization algorithm utilizing traceability. The project's artifacts and trace links are used to create a traceability graph. This graph is then analyzed to locate defective source code files for a given bug report. Since the existing trace link set of a project is possibly incomplete, another algorithm is prosed to augment missing links. The algorithm is fully automated, project independent, and derived from a project's development workflow. An evaluation on more than 32,000 bug reports from 27 open-source projects shows, that incorporating traceability information into bug localization significantly improves the bug localization performance compared to two state of the art algorithms. Further, the trace link augmentation approach reliably constructs missing links and therefore simplifies the required trace maintenance

    Presentation of self on a decentralised web

    Get PDF
    Self presentation is evolving; with digital technologies, with the Web and personal publishing, and then with mainstream adoption of online social media. Where are we going next? One possibility is towards a world where we log and own vast amounts of data about ourselves. We choose to share - or not - the data as part of our identity, and in interactions with others; it contributes to our day-to-day personhood or sense of self. I imagine a world where the individual is empowered by their digital traces (not imprisoned), but this is a complex world. This thesis examines the many factors at play when we present ourselves through Web technologies. I optimistically look to a future where control over our digital identities are not in the hands of centralised actors, but our own, and both survey and contribute to the ongoing technical work which strives to make this a reality. Decentralisation changes things in unexpected ways. In the context of the bigger picture of our online selves, building on what we already know about self-presentation from decades of Social Science research, I examine what might change as we move towards decentralisation; how people could be affected, and what the possibilities are for a positive change. Finally I explore one possible way of self-presentation on a decentralised social Web through lightweight controls which allow an audience to set their expectations in order for the subject to meet them appropriately. I seek to acknowledge the multifaceted, complicated, messy, socially-shaped nature of the self in a way that makes sense to software developers. Technology may always fall short when dealing with humanness, but the framework outlined in this thesis can provide a foundation for more easily considering all of the factors surrounding individual self-presentation in order to build future systems which empower participants

    Big Code Applications and Approaches

    Get PDF
    The availability of a huge amount of source code from code archives and open-source projects opens up the possibility to merge machine learning, programming languages, and software engineering research fields. This area is often referred to as Big Code where programming languages are treated instead of natural languages while different features and patterns of code can be exploited to perform many useful tasks and build supportive tools. Among all the possible applications which can be developed within the area of Big Code, the work presented in this research thesis mainly focuses on two particular tasks: the Programming Language Identification (PLI) and the Software Defect Prediction (SDP) for source codes. Programming language identification is commonly needed in program comprehension and it is usually performed directly by developers. However, when it comes at big scales, such as in widely used archives (GitHub, Software Heritage), automation of this task is desirable. To accomplish this aim, the problem is analyzed from different points of view (text and image-based learning approaches) and different models are created paying particular attention to their scalability. Software defect prediction is a fundamental step in software development for improving quality and assuring the reliability of software products. In the past, defects were searched by manual inspection or using automatic static and dynamic analyzers. Now, the automation of this task can be tackled using learning approaches that can speed up and improve related procedures. Here, two models have been built and analyzed to detect some of the commonest bugs and errors at different code granularity levels (file and method levels). Exploited data and models’ architectures are analyzed and described in detail. Quantitative and qualitative results are reported for both PLI and SDP tasks while differences and similarities concerning other related works are discussed
    corecore