Search CORE

1,027 research outputs found

Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

Author: Brown André EX
Ch'ng Quee-Lim
Currie Michael
Grundy Laura J
Hokanson Jim
Javer Avelino
Kerr Rex
Lee Chee Wai
Li Chris
Li Kezhi
Schafer William R
Yemini Eviatar
Publication venue
Publication date: 20/02/2018
Field of study

We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 201

arXiv.org e-Print Archive

ZENODO

FigShare

Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

Author: Bennin Kwabena Ebo
Chiha I.
Ghotra Baljinder
Menzies Tim
Omran M.
Pedregosa Fabian
Refaeilzadeh Payam
Tan Ming
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/02/2018
Field of study

arXiv.org e-Print Archive

Crossref

Revisiting supervised and unsupervised methods for effort-aware cross-project defect prediction

Author: CHEN Xiang
GU Qing
LO David
NI Chao
XIA Xin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2020
Field of study

Institutional Knowledge at Singapore Management University

A DEEP ENSEMBLE LEARNING METHOD FOR EFFORT-AWARE JUST-IN-TIME DEFECT PREDICTION

Author: ALBAHLI Saleh
Publication venue: Lublin University of Technology
Publication date: 20/11/2019
Field of study

Nowadays, logistics for transportation and distribution of merchandise are a key element to increase the competitiveness of companies. However, the election of alternative routes outside the panned routes causes the logistic companies to provide a poor-quality service, with units that endanger the appropriate deliver of merchandise and impacting negatively the way in which the supply chain works. This paper aims to develop a module that allows the processing, analysis and deployment of satellite information oriented to the pattern analysis, to find anomalies in the paths of the operators by implementing the algorithm TODS, to be able to help in the decision making. The experimental results show that the algorithm detects optimally the abnormal routes using historical data as a base

Multidisciplinary Digital Publishing Institute

Biblioteka Nauki - repozytorium artykuÅÃ³w

Lublin University of Technology Journals

Connecting Software Metrics across Versions to Predict Defects

Author: Guo Jianbo
Li Yanhui
Liu Yibin
Xu Baowen
Zhou Yuming
Publication venue
Publication date: 28/12/2017
Field of study

Accurate software defect prediction could help software practitioners allocate test resources to defect-prone modules effectively and efficiently. In the last decades, much effort has been devoted to build accurate defect prediction models, including developing quality defect predictors and modeling techniques. However, current widely used defect predictors such as code metrics and process metrics could not well describe how software modules change over the project evolution, which we believe is important for defect prediction. In order to deal with this problem, in this paper, we propose to use the Historical Version Sequence of Metrics (HVSM) in continuous software versions as defect predictors. Furthermore, we leverage Recurrent Neural Network (RNN), a popular modeling technique, to take HVSM as the input to build software prediction models. The experimental results show that, in most cases, the proposed HVSM-based RNN model has a significantly better effort-aware ranking effectiveness than the commonly used baseline models

arXiv.org e-Print Archive

Crossref

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

Author: Chakraborty Joymallya
Majumder Suvodeep
Menzies Tim
Publication venue
Publication date: 15/02/2024
Field of study

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.Comment: 36 pages, 10 figures, 5 table

arXiv.org e-Print Archive

500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)

Author: Arthur David
Chen Peter Y
Choetkiertikul M.
Guo Gongde
Mihalcea Rada
Pedregosa Fabian
Rehurek Radim
Ron
Publication venue
Publication date: 14/02/2018
Field of study

Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of software engineering. Deep learners utilizes extensive computational power and can take a long time to train-- making it difficult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for finding related Stack Overflow posts, a tuned SVM performs similarly to a deep learner, but is significantly faster to train. This paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. This approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Significantly, this faster approach generates classifiers nearly as good (within 2\% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models)

arXiv.org e-Print Archive

Crossref

Effort-aware just-in-time defect identification in practice: A case study at Alibaba

Author: FAN Yuanrui
HASSAN Ahmed E.
LO David
XIA Xin
YAN Meng
ZHANG Xindong
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2020
Field of study

National Research Foundation (NRF) Singapore under its AI Singapore Programm

Crossref

Institutional Knowledge at Singapore Management University

JITO: A tool for just-in-time defect identification and localization

Author: FAN Yuanrui
HASSAN Ahmed E.
LO David
QIU Fangcheng
WANG Xinyu
XIA Xin
YAN Meng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2020
Field of study

Australian Research Counci

Crossref

Institutional Knowledge at Singapore Management University