Background: Cross-project defect prediction, which provides feasibility to build defect prediction models in the case of lack of local data repositories, have been gaining attention within research community recently. Many studies have pursued improving predictive performance of cross-project defect prediction models by mitigating challenges related to cross-project defect prediction. However there has been no attempt to analyse the empirical evidence on cross-project defect prediction models in a systematic way.
Objective: The objective of this study is to summarise and synthesise the existing cross-project defect prediction studies in order to identify what kind of independent variables, modelling techniques, performance evaluation criteria and different approaches are used in building cross-project defect prediction models. Further, this study aims to explore the predictive performance of cross-project defect prediction models compared to within-project defect prediction models.
Method: A systematic literature review was conducted to identify 30 relevant primary studies. Then qualitative and quantitative results of those studies were synthesized to answer defined research questions.
Results: The majority of the Cross Project Defect Prediction (CPDP) models have been constructed using combinations of different types of independent variables. The models that perform well tend to be using combinations of different types of independent variables. Models based on Nearest Neighbour (NN) and Decision Tree (DTree) appear to perform well in CPDP context. Most commonly used Naive Bayes (NB) seemed to having average performance among other modelling techniques. Recall, precision, F-measure, probability of false alarm (pf) and Area Under Curve (AUC) are the commonly used performance metrics in cross-project context. Filtering and data transformation are also frequently used approaches in the cross-project context. The majority of the CPDP approaches address one or more data related issues using various row and column processing methods. Models appear to be performing well when filtering approach is used and model is built based on NB. Further, models perform well with data transformation approach is used and model is built based on Support Vector Machine (SVM). There is no significant difference in performance of CPDP models compared with Within Project Defect Prediction (WPDP) model performance. CPDP model perform well in majority cases in terms of recall.
Conclusion: There are various types of independent variables, modelling techniques, performance evaluation criteria that have been used in cross-project defect prediction context. Cross-project defect prediction model performance is influenced by the way it is being built. Cross-project defect prediction still remains as a challenge, but they can achieve comparative predictive performance as within-project defect prediction models when the factors influencing the performance are optimized