Search CORE

7,525 research outputs found

Are screening methods useful in feature selection? An empirical study

Author: Barbu Adrian
Wang Mingyuan
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

Filter or screening methods are often used as a preprocessing step for reducing the number of variables used by a learning algorithm in obtaining a classification or regression model. While there are many such filter methods, there is a need for an objective evaluation of these methods. Such an evaluation is needed to compare them with each other and also to answer whether they are at all useful, or a learning algorithm could do a better job without them. For this purpose, many popular screening methods are partnered in this paper with three regression learners and five classification learners and evaluated on ten real datasets to obtain accuracy criteria such as R-square and area under the ROC curve (AUC). The obtained results are compared through curve plots and comparison tables in order to find out whether screening methods help improve the performance of learning algorithms and how they fare with each other. Our findings revealed that the screening methods were useful in improving the prediction of the best learner on two regression and two classification datasets out of the ten datasets evaluated.Comment: 29 pages, 4 figures, 21 table

arXiv.org e-Print Archive

Directory of Open Access Journals

Machine Learning Algorithms for Breast Cancer Diagnosis: Challenges, Prospects and Future Research Directions

Author: Arika Rebecca Nyasuguta
Cheruiyo W.
Mindila Agnes
Publication venue: 'Bilingual Publishing Co.'
Publication date: 02/11/2022
Field of study

Early diagnosis of breast cancer does not only increase the chances of survival but also control the diffusion of cancerous cells in the body. Previously, researchers have developed machine learning algorithms in breast cancer diagnosis such as Support Vector Machine, K-Nearest Neighbor, Convolutional Neural Network, K-means, Fuzzy C-means, Neural Network, Principle Component Analysis (PCA) and Naive Bayes. Unfortunately these algorithms fall short in one way or another due to high levels of computational complexities. For instance, support vector machine employs feature elimination scheme for eradicating data ambiguity and detecting tumors at initial stage. However this scheme is expensive in terms of execution time. On its part, k-means algorithm employs Euclidean distance to determine the distance between cluster centers and data points. However this scheme does not guarantee high accuracy when executed in different iterations. Although the K-nearest Neighbor algorithm employs feature reduction, principle component analysis and 10 fold cross validation methods for enhancing classification accuracy, it is not efficient in terms of processing time. On the other hand, fuzzy c-means algorithm employs fuzziness value and termination criteria to determine the execution time on datasets. However, it proves to be extensive in terms of computational time due to several iterations and fuzzy measure calculations involved. Similarly, convolutional neural network employed back propagation and classification method but the scheme proves to be slow due to frequent retraining. In addition, the neural network achieves low accuracy in its predictions. Since all these algorithms seem to be expensive and time consuming, it necessary to integrate quantum computing principles with conventional machine learning algorithms. This is because quantum computing has the potential to accelerate computations by simultaneously carrying out calculation on many inputs. In this paper, a review of the current machine learning algorithms for breast cancer prediction is provided. Based on the observed shortcomings, a quantum machine learning based classifier is recommended. The proposed working mechanisms of this classifier are elaborated towards the end of this paper

Bilingual Publishing Co. (BPC): E-Journals

Breast cancer detection using machine learning approaches: a comparative study

Author: Altigani Abdelrahman
Elsadig Muawia A.
Elshoush Huwaida T.
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/02/2023
Field of study

As the cause of the breast cancer disease has not yet clearly identified and a method to prevent its occurrence has not yet been developed, its early detection has a significant role in enhancing survival rate. In fact, artificial intelligent approaches have been playing an important role to enhance the diagnosis process of breast cancer. This work has selected eight classification models that are mostly used to predict breast cancer to be under investigation. These classifiers include single and ensemble classifiers. A trusted dataset has been enhanced by applying five different feature selection methods to pick up only weighted features and to neglect others. Accordingly, a dataset of only 17 features has been developed. Based on our experimental work, three classifiers, multi-layer perceptron (MLP), support vector machine (SVM) and stack are competing with each other by attaining high classification accuracy compared to others. However, SVM is ranked on the top by obtaining an accuracy of 97.7% with classification errors of 0.029 false negative (FN) and 0.019 false positive (FP). Therefore, it is noteworthy to mention that SVM is the best classifier and it outperforms even the stack classier

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

A Classification Framework for Imbalanced Data

Author: Phoungphol Piyaphol
Publication venue: ScholarWorks @ Georgia State University
Publication date: 18/12/2013
Field of study

As information technology advances, the demands for developing a reliable and highly accurate predictive model from many domains are increasing. Traditional classification algorithms can be limited in their performance on highly imbalanced data sets. In this dissertation, we study two common problems when training data is imbalanced, and propose effective algorithms to solve them. Firstly, we investigate the problem in building a multi-class classification model from imbalanced class distribution. We develop an effective technique to improve the performance of the model by formulating the problem as a multi-class SVM with an objective to maximize G-mean value. A ramp loss function is used to simplify and solve the problem. Experimental results on multiple real-world datasets confirm that our new method can effectively solve the multi-class classification problem when the datasets are highly imbalanced. Secondly, we explore the problem in learning a global classification model from distributed data sources with privacy constraints. In this problem, not only data sources have different class distributions but combining data into one central data is also prohibited. We propose a privacy-preserving framework for building a global SVM from distributed data sources. Our new framework avoid constructing a global kernel matrix by mapping non-linear inputs to a linear feature space and then solve a distributed linear SVM from these virtual points. Our method can solve both imbalance and privacy problems while achieving the same level of accuracy as regular SVM. Finally, we extend our framework to handle high-dimensional data by utilizing Generalized Multiple Kernel Learning to select a sparse combination of features and kernels. This new model produces a smaller set of features, but yields much higher accuracy

CiteSeerX

ScholarWorks @ Georgia State University

Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

Author: Mpofu Bongeka
Publication venue
Publication date: 01/12/2018
Field of study

Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science

Unisa Institutional Repository

Recommended from our members

Compressive Sampling and Feature Ranking Framework for Bearing Fault Classification with Vibration Signals

Author: Ahmed HOA
Nandi AK
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Failures of rolling element bearings are amongst the main causes of machines breakdowns. To prevent such breakdowns, bearing health monitoring is performed by collecting data from rotating machines, extracting features from the collected data, and applying a classifier to classify faults. To avoid the burden of much storage requirements and processing time of a tremendously large amount of vibration data, the present paper proposes a combined Compressive Sampling (CS) based on Multiple Measurement Vector (MMV) and Feature Ranking (FR) framework to learn optimally fewer features from a large amount of vibration data from which bearing health conditions can be classified. The CS-based on MMV model is the first step in this framework and provides compressively-sampled signals based on compressed sampling rates. In the second step, the search for the most important features of these compressively-sampled signals is performed using feature ranking and selection techniques. For that purpose, we have investigated the following: (1) two compressible representations of vibration signals that can be used within CS framework, namely, Fast Fourier Transform (FFT) based coefficients and thresholded Wavelet Transform (WT) based coefficients, and (2) several feature ranking and selection techniques, namely, three similarity-based techniques, Fisher Score (FS), Laplacian Score (LS), Relief-F; one correlation-based technique, Pearson Correlation Coefficients (PCC); and one independence test technique, Chi-Square (Chi-2) to select fewer features that can sufficiently represent the original vibration signals. These selected features, in combination with three of the popular classifiers - multinomial Logistic Regression classifier (LRC), Artificial Neural Networks (ANNs), and Support Vector Machines (SVMs), have been evaluated for the classification of bearing faults. Results show that the proposed framework achieves high classification accuracies with a limited amount of data using various combinations of methods, which outperform recently published results

Brunel University Research Archive