11 research outputs found
Fuzzy document classification using ontology based approach for term weighting
With the surge in web corpus, document classification is a vital issue in information retrieval. Term weighting increases the accuracy of classification for documents represented in the vector space model. This paper proposes an ontoTf-idf term weighting method based on the assessment of semantic similarity between the group label and the term. In this paper, a comparative analysis of the performance of the traditional Term Frequency-Inverse Document Frequency (Tf-idf) method and ontoTf-idf method is carried on the WebKB and Reuters-21578 benchmark datasets. The efficiency of ontoTf-idf method is validated with kNN (k nearest neighbor) and Fuzzy kNN classifier on the WebKB and Reuters-21578 datasets. The experimental results obtained with the proposed ontoTf-idf method outperform the Tf-idf method. In the proposed work, distance metrics like Euclidean distance, Cosine similarity, Manhattan distance, and Jaccard co-efficient are applied with Fuzzy kNN classifier on the WebKB and Reuters-21578 dataset
Class Imbalance Reduction and Centroid based Relevant Project Selection for Cross Project Defect Prediction
Cross-Project Defect Prediction (CPDP) is the process of predicting defects in a target project using information from other projects. This can assist developers in prioritizing their testing efforts and finding flaws. Transfer Learning (TL) has been frequently used at CPDP to improve prediction performance by reducing the disparity in data distribution between the source and target projects. Software Defect Prediction (SDP) is a common study topic in software engineering that plays a critical role in software quality assurance. To address the cross-project class imbalance problem, Centroid-based PF-SMOTE for Imbalanced data is used. In this paper, we used a Centroid-based PF-SMOTE to balance the datasets and Centroid based relevant data selection for Cross Project Defect Prediction. These methods use the mean of all attributes in a dataset and calculating the difference between mean of all datasets. For experimentation, the open source software defect datasets namely, AEEM, Re-Link, and NASA, are considered
DBBRBF- Convalesce optimization for software defect prediction problem using hybrid distribution base balance instance selection and radial basis Function classifier
Software is becoming an indigenous part of human life with the rapid
development of software engineering, demands the software to be most reliable.
The reliability check can be done by efficient software testing methods using
historical software prediction data for development of a quality software
system. Machine Learning plays a vital role in optimizing the prediction of
defect-prone modules in real life software for its effectiveness. The software
defect prediction data has class imbalance problem with a low ratio of
defective class to non-defective class, urges an efficient machine learning
classification technique which otherwise degrades the performance of the
classification. To alleviate this problem, this paper introduces a novel hybrid
instance-based classification by combining distribution base balance based
instance selection and radial basis function neural network classifier model
(DBBRBF) to obtain the best prediction in comparison to the existing research.
Class imbalanced data sets of NASA, Promise and Softlab were used for the
experimental analysis. The experimental results in terms of Accuracy,
F-measure, AUC, Recall, Precision, and Balance show the effectiveness of the
proposed approach. Finally, Statistical significance tests are carried out to
understand the suitability of the proposed model.Comment: 32 pages, 24 Tables, 8 Figures
BiLO-CPDP: Bi-Level Programming for Automated Model Discovery in Cross-Project Defect Prediction
Cross-Project Defect Prediction (CPDP), which borrows data from similar
projects by combining a transfer learner with a classifier, have emerged as a
promising way to predict software defects when the available data about the
target project is insufficient. How-ever, developing such a model is challenge
because it is difficult to determine the right combination of transfer learner
and classifier along with their optimal hyper-parameter settings. In this
paper, we propose a tool, dubbedBiLO-CPDP, which is the first of its kind to
formulate the automated CPDP model discovery from the perspective of bi-level
programming. In particular, the bi-level programming proceeds the optimization
with two nested levels in a hierarchical manner. Specifically, the upper-level
optimization routine is designed to search for the right combination of
transfer learner and classifier while the nested lower-level optimization
routine aims to optimize the corresponding hyper-parameter settings.To
evaluateBiLO-CPDP, we conduct experiments on 20 projects to compare it with a
total of 21 existing CPDP techniques, along with its single-level optimization
variant and Auto-Sklearn, a state-of-the-art automated machine learning tool.
Empirical results show that BiLO-CPDP champions better prediction performance
than all other 21 existing CPDP techniques on 70% of the projects, while be-ing
overwhelmingly superior to Auto-Sklearn and its single-level optimization
variant on all cases. Furthermore, the unique bi-level formalization
inBiLO-CPDP also permits to allocate more budget to the upper-level, which
significantly boosts the performance
Sampling imbalance dataset for software defect prediction using hybrid neuro-fuzzy systems with Naive Bayes classifier
PredviÄanje greÅ”aka u raÄunalnom programu (SDP-software defect prediction) je težak zadatak kad se radi o projektima raÄunalnog programa. Taj je postupak koristan za identifikaciju i lokaciju neispravnosti iz modula. Taj Äe zadatak postati skuplji uz dodatak složenih mehanizama za ispitivanje i ocjenjivanje kad se poveÄa veliÄina modula programa. Daljnje konsistentne i disciplinirane provjere programa nude nekoliko prednosti, na pr. toÄnost u procjeni troÅ”kova i programiranja projekta, poveÄanje kvalitete postupka i proizvoda. Detaljna analiza metriÄkih podataka programa takoÄer može znaÄajno pomoÄi u lociranju moguÄih greÅ”aka u programskom kodiranju. Osnovni je cilj ovoga rada predstaviti metode za detekciju i otkrivanje greÅ”aka u programu primjenom postupaka strojnog uÄenja. U radu su koriÅ”teni nebalansirani nizovi podaka iz NASA-inog Metrics Data Programa (MDP) i programska metrika niza podataka izabrana je primjenom GenetiÄkog algoritma metodom Optimizacije kolonije mrava (Ant Colony Optimization -GACO). Postupak uzorkovanja metodom Modified Co Forest - polu-nadgledanog uÄenja, generira balansirano oznaÄene nizove podataka koristeÄi nebalansirane nizove, a primjenjuje se za uÄinkoviti postupak otkrivanja greÅ”ke u programu s Hibridnim Neuro-Fuzzy sustavima za strojno uÄenje po Naive Bayes metodama. Eksperimentalni rezultati predložene metode dokazuju da je ova metoda za otkrivanje greÅ”ke u raÄunalnom program uÄinkovitija od drugih postojeÄih metoda, s boljim rezultatima u predviÄanju greÅ”ke.Software defect prediction (SDP) is a process with difficult tasks in the case of software projects. The SDP process is useful for the identification and location of defects from the modules. This task will tend to become more costly with the addition of complex testing and evaluation mechanisms, when the software project modules size increases. Further measurement of software in a consistent and disciplined manner offers several advantages like accuracy in the estimation of project costs and schedules, and improving product and process qualities. Detailed analysis of software metric data also gives significant clues about the locations of possible defects in a programming code. The main goal of this proposed work is to introduce software defects detection and prevention methods for identifying defects from software using machine learning approaches. This proposed work used imbalanced datasets from NASAās Metrics Data Program (MDP) and software metrics of datasets are selected by using Genetic algorithm with Ant Colony Optimization (GACO) method. The sampling process with semi supervised learning Modified Co Forest method generates the balanced labelled using imbalanced datasets, which is used for efficient software defect detection process with machine learning Hybrid Neuro-Fuzzy Systems with Naive Bayes methods. The experimental results of this proposed method proves that this defect detecting machine learning method yields more efficiency and better performance in defect prediction result of software in comparison with the other available methods
A Review Of Training Data Selection In Software Defect Prediction
The publicly available dataset poses a challenge in selecting the suitable data to train a defect prediction model to predict defect on other projects. Using a cross-project training dataset without a careful selection will degrade the defect prediction performance. Consequently, training data selection is an essential step to develop a defect prediction model. This paper aims to synthesize the state-of-the-art for training data selection methods published from 2009 to 2019. The existing approaches addressing the training data
selection issue fall into three groups, which are nearest neighbour, cluster-based, and evolutionary method. According to the results in the literature, the cluster-based method tends to outperform the nearest neighbour method. On the other hand, the research on evolutionary techniques gives promising results but is still scarce. Therefore, the review concludes that there is still some open area for further investigation in training data selection. We also present research direction within this are
Cost-Sensitive Radial Basis Function Neural Network Classifier for Software Defect Prediction
Effective prediction of software modules, those that are prone to defects, will enable software developers to achieve efficient allocation of resources and to concentrate on quality assurance activities. The process of software development life cycle basically includes design, analysis, implementation, testing, and release phases. Generally, software testing is a critical task in the software development process wherein it is to save time and budget by detecting defects at the earliest and deliver a product without defects to the customers. This testing phase should be carefully operated in an effective manner to release a defect-free (bug-free) software product to the customers. In order to improve the software testing process, fault prediction methods identify the software parts that are more noted to be defect-prone. This paper proposes a prediction approach based on conventional radial basis function neural network (RBFNN) and the novel adaptive dimensional biogeography based optimization (ADBBO) model. The developed ADBBO based RBFNN model is tested with five publicly available datasets from the NASA data program repository. The computed results prove the effectiveness of the proposed ADBBO-RBFNN classifier approach with respect to the considered metrics in comparison with that of the early predictors available in the literature for the same datasets