7,852 research outputs found

    Semi-supervised and Active Learning Models for Software Fault Prediction

    Get PDF
    As software continues to insinuate itself into nearly every aspect of our life, the quality of software has been an extremely important issue. Software Quality Assurance (SQA) is a process that ensures the development of high-quality software. It concerns the important problem of maintaining, monitoring, and developing quality software. Accurate detection of fault prone components in software projects is one of the most commonly practiced techniques that offer the path to high quality products without excessive assurance expenditures. This type of quality modeling requires the availability of software modules with known fault content developed in similar environment. However, collection of fault data at module level, particularly in new projects, is expensive and time-consuming. Semi-supervised learning and active learning offer solutions to this problem for learning from limited labeled data by utilizing inexpensive unlabeled data.;In this dissertation, we investigate semi-supervised learning and active learning approaches in the software fault prediction problem. The role of base learner in semi-supervised learning is discussed using several state-of-the-art supervised learners. Our results showed that semi-supervised learning with appropriate base learner leads to better performance in fault proneness prediction compared to supervised learning. In addition, incorporating pre-processing technique prior to semi-supervised learning provides a promising direction to further improving the prediction performance. Active learning, sharing the similar idea as semi-supervised learning in utilizing unlabeled data, requires human efforts for labeling fault proneness in its learning process. Empirical results showed that active learning supplemented by dimensionality reduction technique performs better than the supervised learning on release-based data sets

    On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow

    Full text link
    Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.Comment: Preprint of paper accepted for the Proc. of the 21st International Conference on Evaluation and Assessment in Software Engineering, 201

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Get PDF
    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science
    • …
    corecore