    Predicting Good Configurations for GitHub and Stack Overflow Topic Models

    Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International Conference on Mining Software Repositorie

    Anti-Patterns for Automatic Program Repairs

    Automated program repair has been a heated topic in software engineering. In recent years, we have witnessed many successful applications such as Genprog, SPR, RSRepair, etc. Given a bug and its test suite, which includes both passed test cases and failed test cases, these tools aim to automatically generate a patch that fixes the bug without developers' efforts. All these tools adopt a "Generate-and-Validate" approach, which assumes a tool-generated patch to be correct as long as it passes all its test cases. However, if test suites are of poor quality that cannot cover all the situations, incorrect tool-generated patches might pass all their test cases and be regarded as correct patches. We call such patches that are incorrect but can pass whose test suites as overfitted patches. In order to investigate the reasons why overfitted patches are generated and to reduce overfitted patches, we perform a deep analysis on the patches composed by developers, and the patches (i.e., the correct and the overfitted patches) that are generated by Genprog and SPR. In this thesis, we propose two orthogonal approaches to filter out overfitted patches: 1) To preserve correct tool-generated patches and filter out only overfitted patches, we propose some patterns, named anti-patterns, that can efficiently distinguish correct patches against overfitted patches. We select nine bugs from the Genprog benchmark data set to evaluate the anti-patterns. By embedding the anti-patterns into SPR and filtering out overfitted patches, on average, developers can review 44.7% less tool-generated patches to reach correct patches. Meanwhile, by filtering out overfitted patches at runtime, the anti-patterns speed up SPR's efficiency by 1.34 times on average. 2) We leverage machine learning techniques with meaningful features to predict the correctness of tool-generated patches. Our results show that the machine learning approach cannot preserve correct patches well. In other words, machine learning techniques would mis-classify correct patches as overfitted patches and filter them out. Thus, we believe the machine learning approach requires significant future work, e.g., more representative features and effective classification algorithms, to be useful in practice. These two orthogonal approaches provide automatic program repair tools with valuable guidance on how to avoid generating overfitted patches

    Evidence-based defect assessment and prediction for software product lines

    The systematic reuse provided by software product lines provides opportunities to achieve increased quality and reliability as a product line matures. This has led to a widely accepted assumption that as a product line evolves, its reliability improves. However, evidence in terms of empirical investigation of the relationship among change, reuse and reliability in evolving software product lines is lacking. To address the problem this work investigates: 1) whether reliability as measured by post-deployment failures improves as the products and components in a software product line change over time, and 2) whether the stabilizing effect of shared artifacts enables accurate prediction of failure-prone files in the product line. The first part of this work performs defect assessment and investigates defect trends in Eclipse, an open-source software product line. It analyzes the evolution of the product line over time in terms of the total number of defects, the percentage of severe defects and the relationship between defects and changes. The second part of this work explores prediction of failure-prone files in the Eclipse product line to determine whether prediction improves as the product line evolves over time. In addition, this part investigates the effect of defect and data collection periods on the prediction performance. The main contributions of this work include findings that the majority of files with severe defects are reused files rather than new files, but that common components experience less change than variation components. The work also found that there is a consistent set of metrics which serve as prominent predictors across multiple products and reuse categories over time. Classification of post-release, failure-prone files using change data for the Eclipse product line gives better recall and false positive rates as compared to classification using static code metrics. The work also found that on-going change in product lines hinders the ability to predict failure-prone files, and that predicting post-release defects using pre-release change data for the Eclipse case study is difficult. For example, using more data from the past to predict future failure-prone files does not necessarily give better results than using data only from the recent past. The empirical investigation of product line change and defect data leads to an improved understanding of the interplay among change, reuse and reliability as a product line evolves

    Quality-Aware Learning to Prioritize Test Cases

    Software applications evolve at a rapid rate because of continuous functionality extensions, changes in requirements, optimization of code, and fixes of faults. Moreover, modern software is often composed of components engineered with different programming languages by different internal or external teams. During this evolution, it is crucial to continuously detect unintentionally injected faults and continuously release new features. Software testing aims at reducing this risk by running a certain suite of test cases regularly or at each change of the source code. However, the large number of test cases makes it infeasible to run all test cases. Automated test case prioritization and selection techniques have been studied in order to reduce the cost and improve the efficiency of testing tasks. However, the current state-of-art techniques remain limited in some aspects. First, the existing test prioritization and selection techniques often assume that faults are equally distributed across the software components, which can lead to spending most of the testing budget on components less likely to fail rather than the ones highly to contain faults. Second, the existing techniques share a scalability problem not only in terms of the size of the selected test suite but also in terms of the round-trip time between code commits and engineer feedback on test cases failures in the context of Continuous Integration (CI) development environments. Finally, it is hard to algorithmically capture the domain knowledge of the human testers which is crucial in testing and release cycles. This thesis is a new take on the old problem of reducing the cost of software testing in these regards by presenting a data-driven lightweight approach for test case prioritization and execution scheduling that is being used (i) during CI cycles for quick and resource-optimal feedback to engineers, and (ii) during release planning by capturing the testers domain knowledge and release requirements. Our approach combines software quality metrics with code churn metrics to build a regressive model that predicts the fault density of each component and a classification model to discriminate faulty from non-faulty components. Both models are used to guide the testing effort to the components likely to contain the largest number of faults. The predictive models have been validated on eight industrial automotive software applications at Daimler, showing a classification accuracy of 89% and an accuracy of 85.7% for the regression model. The thesis develops a test cases prioritization model based on features of the code change, the tests execution history and the component development history. The model reduces the cost of CI by predicting whether a particular code change should trigger the individual test suites and their corresponding test cases. In order to algorithmically capture the domain knowledge and the preferences of the tester, our approach developed a test case execution scheduling model that consumes the testers preferences in the form of a probabilistic graph and solves the optimal test budget allocation problem both online in the context of CI cycles and offline when planning a release. Finally, the thesis presents a theoretical cost model that describes when our prioritization and scheduling approach is worthwhile. The overall approach is validated on two industrial analytical applications in the area of energy management and predictive maintenance, showing that over 95% of the test failures are still reported back to the engineers while only 43% of the total available test cases are being executed

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science

    Leveraging Machine Learning to Improve Software Reliability

    Finding software faults is a critical task during the lifecycle of a software system. While traditional software quality control practices such as statistical defect prediction, static bug detection, regression test, and code review are often inefficient and time-consuming, which cannot keep up with the increasing complexity of modern software systems. We argue that machine learning with its capability in knowledge representation, learning, natural language processing, classification, etc., can be used to extract invaluable information from software artifacts that may be difficult to obtain with other research methodologies to improve existing software reliability practices such as statistical defect prediction, static bug detection, regression test, and code review. This thesis presents a suite of machine learning based novel techniques to improve existing software reliability practices for helping developers find software bugs more effective and efficient. First, it introduces a deep learning based defect prediction technique to improve existing statistical defect prediction models. To build accurate prediction models, previous studies focused on manually designing features that encode the statistical characteristics of programs. However, these features often fail to capture the semantic difference of programs, and such a capability is needed for building accurate prediction models. To bridge the gap between programs' semantics and defect prediction features, this thesis leverages deep learning techniques to learn a semantic representation of programs automatically from source code and further build and train defect prediction models by using these semantic features. We examine the effectiveness of the deep learning based prediction models on both the open-source and commercial projects. Results show that the learned semantic features can significantly outperform existing defect prediction models. Second, it introduces an n-gram language based static bug detection technique, i.e., Bugram, to detect new types of bugs with less false positives. Most of existing static bug detection techniques are based on programming rules inferred from source code. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs. To solve this issue, this thesis proposes Bugram, which leverages n-gram language models instead of rules to detect bugs. Specifically, Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware. We examine the effectiveness of our approach on the latest versions of 16 open-source projects. Results show that Bugram detected 25 new bugs, 23 of which cannot be detected by existing rule-based bug detection approaches, which suggests that Bugram is complementary to existing bug detection approaches to detect more bugs and generates less false positives. Third, it introduces a machine learning based regression test prioritization technique, i.e., QTEP, to find and run test cases that could reveal bugs earlier. Existing test case prioritization techniques mainly focus on maximizing coverage information between source code and test cases to schedule test cases for finding bugs earlier. While they often do not consider the likely distribution of faults in the source code. However, software faults are not often equally distributed in source code, e.g., around 80\% faults are located in about 20\% source code. Intuitively, test cases that cover the faulty source code should have higher priorities, since they are more likely to find faults. To solve this issue, this thesis proposes QTEP, which leverages machine learning models to evaluate source code quality and then adapt existing test case prioritization algorithms by considering the weighted source code quality. Evaluation on seven open-source projects shows that QTEP can significantly outperform existing test case prioritization techniques to find failed test cases early. Finally, it introduces a machine learning based approach to identifying risky code review requests. Code review has been widely adopted in the development process of both the proprietary and open-source software, which helps improve the maintenance and quality of software before the code changes being merged into the source code repository. Our observation on code review requests from four large-scale projects reveals that around 20\% changes cannot pass the first round code review and require non-trivial revision effort (i.e., risky changes). In addition, resolving these risky changes requires 3X more time and 1.6X more reviewers than the regular changes (i.e., changes pass the first code review) on average. This thesis presents the first study to characterize these risky changes and automatically identify these risky changes with machine learning classifiers. Evaluation on one proprietary project and three large-scale open-source projects (i.e., Qt, Android, and OpenStack) shows that our approach is effective in identifying risky code review requests. Taken together, the results of the four studies provide evidence that machine learning can help improve traditional software reliability such as statistical defect prediction, static bug detection, regression test, and code review

