9 research outputs found

    Looking Over the Research Literature on Software Engineering from 2016 to 2018

    Get PDF
    This paper carries out a bibliometric analysis to detect (i) what is the most influential research on software engineering at the moment, (ii) where is being published that relevant research, (iii) what are the most commonly researched topics, (iv) and where is being undertaken that research (i.e., in which countries and institutions). For that, 6,365 software engineering articles, published from 2016 to 2018 on a variety of conferences and journals, are examined.This work has been funded by the Spanish Ministry of Science, Innovation, and Universities under Project DPI2016-77677-P, the Community of Madrid under Grant RoboCity2030-DIH-CM P2018/NMT-4331, and grant TIN2016-75850-R from the FEDER funds

    A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

    Get PDF
    Predicting when and where bugs will appear in software may assist improve quality and save on software testing expenses. Predicting bugs in individual modules of software by utilizing machine learning methods. There are, however, two major problems with the software defect prediction dataset: Social stratification (there are many fewer faulty modules than non-defective ones), and noisy characteristics (a result of irrelevant features) that make accurate predictions difficult. The performance of the machine learning model will suffer greatly if these two issues arise. Overfitting will occur, and biassed classification findings will be the end consequence. In this research, we suggest using machine learning approaches to enhance the usefulness of the CatBoost and Gradient Boost classifiers while predicting software flaws. Both the Random Over Sampler and Mutual info classification methods address the class imbalance and feature selection issues inherent in software fault prediction. Eleven datasets from NASA's data repository, "Promise," were utilised in this study. Using 10-fold cross-validation, we classified these 11 datasets and found that our suggested technique outperformed the baseline by a significant margin. The proposed methods have been evaluated based on their abilities to anticipate software defects using the most important indices available: Accuracy, Precision, Recall, F1 score, ROC values, RMSE, MSE, and MAE parameters. For all 11 datasets evaluated, the suggested methods outperform baseline classifiers by a significant margin. We tested our model to other methods of flaw identification and found that it outperformed them all. The computational detection rate of the suggested model is higher than that of conventional models, as shown by the experiments.

    Optimized Deeplearning Algorithm for Software Defects Prediction

    Get PDF
    Accurate software defect prediction (SDP) helps to enhance the quality of the software by identifying potential flaws early in the development process. However, existing approaches face challenges in achieving reliable predictions. To address this, a novel approach is proposed that combines a two-tier-deep learning framework. The proposed work includes four major phases:(a) pre-processing, (b) Dimensionality reduction, (c) Feature Extraction and (d) Two-fold deep learning-based SDP. The collected raw data is initially pre-processed using a data cleaning approach (handling null values and missing data) and a Decimal scaling normalisation approach. The dimensions of the pre-processed data are reduced using the newly developed Incremental Covariance Principal Component Analysis (ICPCA), and this approach aids in solving the “curse of dimensionality” issue. Then, onto the dimensionally reduced data, the feature extraction is performed using statistical features (standard deviation, skewness, variance, and kurtosis), Mutual information (MI), and Conditional entropy (CE). From the extracted features, the relevant ones are selected using the new Euclidean Distance with Mean Absolute Deviation (ED-MAD). Finally, the SDP (decision making) is carried out using the optimized Two-Fold Deep Learning Framework (O-TFDLF), which encapsulates the RBFN and optimized MLP, respectively. The weight of MLP is fine-tuned using the new Levy Flight Cat Mouse Optimisation (LCMO) method to improve the model's prediction accuracy. The final detected outcome (forecasting the presence/ absence of defect) is acquired from optimized MLP. The implementation has been performed using the MATLAB software. By using certain performance metrics such as Sensitivity, Accuracy, Precision, Specificity and MSE the proposed model’s performance is compared to that of existing models. The accuracy achieved for the proposed model is 93.37%

    When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

    Full text link
    Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.Comment: 36 pages, 10 figures, 5 table

    Cluster-based oversampling with area extraction from representative points for class imbalance learning

    Get PDF
    Class imbalance learning is challenging in various domains where training datasets exhibit disproportionate samples in a specific class. Resampling methods have been used to adjust the class distribution, but they often have limitations for small disjunct minority subsets. This paper introduces AROSS, an adaptive cluster-based oversampling approach that addresses these limitations. AROSS utilizes an optimized agglomerative clustering algorithm with the Cophenetic Correlation Coefficient and the Bayesian Information Criterion to identify representative areas of the minority class. Safe and half-safe areas are obtained using an incremental k-Nearest Neighbor strategy, and oversampling is performed with a truncated hyperspherical Gaussian distribution. Experimental evaluations on 70 binary datasets demonstrate the effectiveness of AROSS in improving class imbalance learning performance, making it a promising solution for mitigating class imbalance challenges, especially for small disjunct minority subsets

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Get PDF
    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science

    Explanatory and Causality Analysis in Software Engineering

    Get PDF
    Software fault proneness and software development efforts are two key areas of software engineering. Improving them will significantly reduce the cost and promote good planning and practice in developing and managing software projects. Traditionally, studies of software fault proneness and software development efforts were focused on analysis and prediction, which can help to answer questions like `when’ and `where’. The focus of this dissertation is on explanatory and causality studies that address questions like `why’ and `how’. First, we applied a case-control study to explain software fault proneness. We found that Bugfixes (Prerelease bugs), Developers, Code Churn, and Age of a file are the main contributors to the Postrelease bugs in some of the open-source projects. In terms of the interactions, we found that Bugfixes and Developers reduced the risk of post release software faults. The explanatory models were tested for prediction and their performance was either comparable or better than the top-performing classifiers used in related studies. Our results indicate that software project practitioners should pay more attention to the prerelease bug fixing process and the number of Developers assigned, as well as their interaction. Also, they need to pay more attention to the new files (less than one year old) which contributed significantly more to Postrelease bugs more than old files. Second, we built a model that explains and predicts multiple levels of software development effort and measured the effects of several metrics and their interactions using categorical regression models. The final models for the three data sets used were statistically fit, and performance was comparable to related studies. We found that project size, duration, the existence of any type of faults, the use of first- or second generation of programming languages, and team size significantly increased the software development effort. On the other side, the interactions between duration and defective project, and between duration and team size reduced the software development effort. These results suggest that software practitioners should pay extra attention to the time of the project and the team size assigned for every task because when they increased from a low to a higher level, they significantly increased the software development effort. Third, a structural equation modeling method was applied for causality analysis of software fault proneness. The method combined statistical and regression analysis to find the direct and indirect causes for software faults using partial least square path modeling method. We found direct and indirect paths from measurement models that led to software postrelease bugs. Specifically, the highest direct effect came from the change request, while changing the code had a minor impact on software faults. The highest impact of the code change resulted from the change requests (either for bug fixing or refactoring). Interestingly, the indirect impact from code characteristics to software fault proneness was higher than the direct impact. We found a similar level of direct and indirect impact from code characteristics to code change
    corecore