7,525 research outputs found

    Are screening methods useful in feature selection? An empirical study

    Full text link
    Filter or screening methods are often used as a preprocessing step for reducing the number of variables used by a learning algorithm in obtaining a classification or regression model. While there are many such filter methods, there is a need for an objective evaluation of these methods. Such an evaluation is needed to compare them with each other and also to answer whether they are at all useful, or a learning algorithm could do a better job without them. For this purpose, many popular screening methods are partnered in this paper with three regression learners and five classification learners and evaluated on ten real datasets to obtain accuracy criteria such as R-square and area under the ROC curve (AUC). The obtained results are compared through curve plots and comparison tables in order to find out whether screening methods help improve the performance of learning algorithms and how they fare with each other. Our findings revealed that the screening methods were useful in improving the prediction of the best learner on two regression and two classification datasets out of the ten datasets evaluated.Comment: 29 pages, 4 figures, 21 table

    Machine Learning Algorithms for Breast Cancer Diagnosis: Challenges, Prospects and Future Research Directions

    Get PDF
    Early diagnosis of breast cancer does not only increase the chances of survival but also control the diffusion of cancerous cells in the body. Previously, researchers have developed machine learning algorithms in breast cancer diagnosis such as Support Vector Machine, K-Nearest Neighbor, Convolutional Neural Network, K-means, Fuzzy C-means, Neural Network, Principle Component Analysis (PCA) and Naive Bayes. Unfortunately these algorithms fall short in one way or another due to high levels of computational complexities. For instance, support vector machine employs feature elimination scheme for eradicating data ambiguity and detecting tumors at initial stage. However this scheme is expensive in terms of execution time. On its part, k-means algorithm employs Euclidean distance to determine the distance between cluster centers and data points. However this scheme does not guarantee high accuracy when executed in different iterations. Although the K-nearest Neighbor algorithm employs feature reduction, principle component analysis and 10 fold cross validation methods for enhancing classification accuracy, it is not efficient in terms of processing time. On the other hand, fuzzy c-means algorithm employs fuzziness value and termination criteria to determine the execution time on datasets. However, it proves to be extensive in terms of computational time due to several iterations and fuzzy measure calculations involved. Similarly, convolutional neural network employed back propagation and classification method but the scheme proves to be slow due to frequent retraining. In addition, the neural network achieves low accuracy in its predictions. Since all these algorithms seem to be expensive and time consuming, it necessary to integrate quantum computing principles with conventional machine learning algorithms. This is because quantum computing has the potential to accelerate computations by simultaneously carrying out calculation on many inputs. In this paper, a review of the current machine learning algorithms for breast cancer prediction is provided. Based on the observed shortcomings, a quantum machine learning based classifier is recommended. The proposed working mechanisms of this classifier are elaborated towards the end of this paper

    Breast cancer detection using machine learning approaches: a comparative study

    Get PDF
    As the cause of the breast cancer disease has not yet clearly identified and a method to prevent its occurrence has not yet been developed, its early detection has a significant role in enhancing survival rate. In fact, artificial intelligent approaches have been playing an important role to enhance the diagnosis process of breast cancer. This work has selected eight classification models that are mostly used to predict breast cancer to be under investigation. These classifiers include single and ensemble classifiers. A trusted dataset has been enhanced by applying five different feature selection methods to pick up only weighted features and to neglect others. Accordingly, a dataset of only 17 features has been developed. Based on our experimental work, three classifiers, multi-layer perceptron (MLP), support vector machine (SVM) and stack are competing with each other by attaining high classification accuracy compared to others. However, SVM is ranked on the top by obtaining an accuracy of 97.7% with classification errors of 0.029 false negative (FN) and 0.019 false positive (FP). Therefore, it is noteworthy to mention that SVM is the best classifier and it outperforms even the stack classier

    A Classification Framework for Imbalanced Data

    Get PDF
    As information technology advances, the demands for developing a reliable and highly accurate predictive model from many domains are increasing. Traditional classification algorithms can be limited in their performance on highly imbalanced data sets. In this dissertation, we study two common problems when training data is imbalanced, and propose effective algorithms to solve them. Firstly, we investigate the problem in building a multi-class classification model from imbalanced class distribution. We develop an effective technique to improve the performance of the model by formulating the problem as a multi-class SVM with an objective to maximize G-mean value. A ramp loss function is used to simplify and solve the problem. Experimental results on multiple real-world datasets confirm that our new method can effectively solve the multi-class classification problem when the datasets are highly imbalanced. Secondly, we explore the problem in learning a global classification model from distributed data sources with privacy constraints. In this problem, not only data sources have different class distributions but combining data into one central data is also prohibited. We propose a privacy-preserving framework for building a global SVM from distributed data sources. Our new framework avoid constructing a global kernel matrix by mapping non-linear inputs to a linear feature space and then solve a distributed linear SVM from these virtual points. Our method can solve both imbalance and privacy problems while achieving the same level of accuracy as regular SVM. Finally, we extend our framework to handle high-dimensional data by utilizing Generalized Multiple Kernel Learning to select a sparse combination of features and kernels. This new model produces a smaller set of features, but yields much higher accuracy

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Get PDF
    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science
    • …
    corecore