7,525 research outputs found
Are screening methods useful in feature selection? An empirical study
Filter or screening methods are often used as a preprocessing step for
reducing the number of variables used by a learning algorithm in obtaining a
classification or regression model. While there are many such filter methods,
there is a need for an objective evaluation of these methods. Such an
evaluation is needed to compare them with each other and also to answer whether
they are at all useful, or a learning algorithm could do a better job without
them. For this purpose, many popular screening methods are partnered in this
paper with three regression learners and five classification learners and
evaluated on ten real datasets to obtain accuracy criteria such as R-square and
area under the ROC curve (AUC). The obtained results are compared through curve
plots and comparison tables in order to find out whether screening methods help
improve the performance of learning algorithms and how they fare with each
other. Our findings revealed that the screening methods were useful in
improving the prediction of the best learner on two regression and two
classification datasets out of the ten datasets evaluated.Comment: 29 pages, 4 figures, 21 table
Machine Learning Algorithms for Breast Cancer Diagnosis: Challenges, Prospects and Future Research Directions
Early diagnosis of breast cancer does not only increase the chances of survival but also control the diffusion of cancerous cells in the body. Previously, researchers have developed machine learning algorithms in breast cancer diagnosis such as Support Vector Machine, K-Nearest Neighbor, Convolutional Neural Network, K-means, Fuzzy C-means, Neural Network, Principle Component Analysis (PCA) and Naive Bayes. Unfortunately these algorithms fall short in one way or another due to high levels of computational complexities. For instance, support vector machine employs feature elimination scheme for eradicating data ambiguity and detecting tumors at initial stage. However this scheme is expensive in terms of execution time. On its part, k-means algorithm employs Euclidean distance to determine the distance between cluster centers and data points. However this scheme does not guarantee high accuracy when executed in different iterations. Although the K-nearest Neighbor algorithm employs feature reduction, principle component analysis and 10 fold cross validation methods for enhancing classification accuracy, it is not efficient in terms of processing time. On the other hand, fuzzy c-means algorithm employs fuzziness value and termination criteria to determine the execution time on datasets. However, it proves to be extensive in terms of computational time due to several iterations and fuzzy measure calculations involved. Similarly, convolutional neural network employed back propagation and classification method but the scheme proves to be slow due to frequent retraining. In addition, the neural network achieves low accuracy in its predictions. Since all these algorithms seem to be expensive and time consuming, it necessary to integrate quantum computing principles with conventional machine learning algorithms. This is because quantum computing has the potential to accelerate computations by simultaneously carrying out calculation on many inputs. In this paper, a review of the current machine learning algorithms for breast cancer prediction is provided. Based on the observed shortcomings, a quantum machine learning based classifier is recommended. The proposed working mechanisms of this classifier are elaborated towards the end of this paper
Breast cancer detection using machine learning approaches: a comparative study
As the cause of the breast cancer disease has not yet clearly identified and a method to prevent its occurrence has not yet been developed, its early detection has a significant role in enhancing survival rate. In fact, artificial intelligent approaches have been playing an important role to enhance the diagnosis process of breast cancer. This work has selected eight classification models that are mostly used to predict breast cancer to be under investigation. These classifiers include single and ensemble classifiers. A trusted dataset has been enhanced by applying five different feature selection methods to pick up only weighted features and to neglect others. Accordingly, a dataset of only 17 features has been developed. Based on our experimental work, three classifiers, multi-layer perceptron (MLP), support vector machine (SVM) and stack are competing with each other by attaining high classification accuracy compared to others. However, SVM is ranked on the top by obtaining an accuracy of 97.7% with classification errors of 0.029 false negative (FN) and 0.019 false positive (FP). Therefore, it is noteworthy to mention that SVM is the best classifier and it outperforms even the stack classier
A Classification Framework for Imbalanced Data
As information technology advances, the demands for developing a reliable and highly accurate predictive model from many domains are increasing. Traditional classification algorithms can be limited in their performance on highly imbalanced data sets. In this dissertation, we study two common problems when training data is imbalanced, and propose effective algorithms to solve them.
Firstly, we investigate the problem in building a multi-class classification model from imbalanced class distribution. We develop an effective technique to improve the performance of the model by formulating the problem as a multi-class SVM with an objective to maximize G-mean value. A ramp loss function is used to simplify and solve the problem. Experimental results on multiple real-world datasets confirm that our new method can effectively solve the multi-class classification problem when the datasets are highly imbalanced.
Secondly, we explore the problem in learning a global classification model from distributed data sources with privacy constraints. In this problem, not only data sources have different class distributions but combining data into one central data is also prohibited. We propose a privacy-preserving framework for building a global SVM from distributed data sources. Our new framework avoid constructing a global kernel matrix by mapping non-linear inputs to a linear feature space and then solve a distributed linear SVM from these virtual points. Our method can solve both imbalance and privacy problems while achieving the same level of accuracy as regular SVM.
Finally, we extend our framework to handle high-dimensional data by utilizing Generalized Multiple Kernel Learning to select a sparse combination of features and kernels. This new model produces a smaller set of features, but yields much higher accuracy
Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection
Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science
Recommended from our members
Compressive Sampling and Feature Ranking Framework for Bearing Fault Classification with Vibration Signals
Failures of rolling element bearings are amongst the main causes of machines breakdowns. To
prevent such breakdowns, bearing health monitoring is performed by collecting data from rotating machines,
extracting features from the collected data, and applying a classifier to classify faults. To avoid the burden of
much storage requirements and processing time of a tremendously large amount of vibration data, the present
paper proposes a combined Compressive Sampling (CS) based on Multiple Measurement Vector (MMV) and
Feature Ranking (FR) framework to learn optimally fewer features from a large amount of vibration data
from which bearing health conditions can be classified. The CS-based on MMV model is the first step in this
framework and provides compressively-sampled signals based on compressed sampling rates. In the second
step, the search for the most important features of these compressively-sampled signals is performed using
feature ranking and selection techniques. For that purpose, we have investigated the following: (1) two
compressible representations of vibration signals that can be used within CS framework, namely, Fast Fourier
Transform (FFT) based coefficients and thresholded Wavelet Transform (WT) based coefficients, and (2)
several feature ranking and selection techniques, namely, three similarity-based techniques, Fisher Score
(FS), Laplacian Score (LS), Relief-F; one correlation-based technique, Pearson Correlation Coefficients
(PCC); and one independence test technique, Chi-Square (Chi-2) to select fewer features that can sufficiently
represent the original vibration signals. These selected features, in combination with three of the popular
classifiers - multinomial Logistic Regression classifier (LRC), Artificial Neural Networks (ANNs), and
Support Vector Machines (SVMs), have been evaluated for the classification of bearing faults. Results show
that the proposed framework achieves high classification accuracies with a limited amount of data using
various combinations of methods, which outperform recently published results
- …