11 research outputs found
Comparative Study of Classification Techniques on Breast Cancer FNA Biopsy Data
Accurate diagnostic detection of the
cancerous cells in a patient is critical and may alter the
subsequent treatment and increase the chances of
survival rate. Machine learning techniques have been
instrumental in disease detection and are currently
being used in various classification problems due to
their accurate prediction performance. Various
techniques may provide different desired accuracies and
it is therefore imperative to use the most suitable method
which provides the best desired results. This research
seeks to provide comparative analysis of Support Vector
Machine, Bayesian classifier and other Artificial neural
network classifiers (Backpropagation, linear
programming, Learning vector quantization, and K
nearest neighborhood) on the Wisconsin breast cancer
classification problem
Design of Machine Learning Algorithms with Applications to Breast Cancer Detection
Machine learning is concerned with the design and development of algorithms and
techniques that allow computers to 'learn' from experience with respect to some class
of tasks and performance measure. One application of machine learning is to improve
the accuracy and efficiency of computer-aided diagnosis systems to assist physician,
radiologists, cardiologists, neuroscientists, and health-care technologists. This thesis
focuses on machine learning and the applications to breast cancer detection. Emphasis
is laid on preprocessing of features, pattern classification, and model selection.
Before the classification task, feature selection and feature transformation may be
performed to reduce the dimensionality of the features and to improve the classification
performance. Genetic algorithm (GA) can be employed for feature selection based
on different measures of data separability or the estimated risk of a chosen classifier.
A separate nonlinear transformation can be performed by applying kernel principal
component analysis and kernel partial least squares.
Different classifiers are proposed in this work: The SOM-RBF network combines
self-organizing maps (SOMs) and radial basis function (RBF) networks, with the RBF
centers set as the weight vectors of neurons from the competitive layer of a trained
SaM. The pairwise Rayleigh quotient (PRQ) classifier seeks one discriminating boundary
by maximizing an unconstrained optimization objective, named as the PRQ criterion,
formed with a set of pairwise const~aints instead of individual training samples.
The strict 2-surface proximal (S2SP) classifier seeks two proximal planes that are not
necessary parallel to fit the distribution of the samples in the original feature space or
a kernel-defined feature space, by ma-ximizing two strict optimization objectives with
a 'square of sum' optimization factor. Two variations of the support vector data description
(SVDD) with negative samples (NSVDD) are proposed by involving different
forms of slack vectors, which learn a closed spherically shaped boundary, named as the
supervised compact hypersphere (SCH), around a set of samples in the target class. \Ve
extend the NSVDDs to solve the multi-class classification problems based on distances
between the samples and the centers of the learned SCHs in a kernel-defined feature
space, using a combination of linear discriminant analysis and the nearest-neighbor rule.
The problem of model selection is studied to pick the best values of the hyperparameters
for a parametric classifier. To choose the optimal kernel or regularization
parameters of a classifier, we investigate different criteria, such as the validation error
estimate and the leave-out-out bound, as well as different optimization methods, such
as grid search, gradient descent, and GA. By viewing the tuning problem of the multiple
parameters of an 2-norm support vector machine (SVM) as an identification problem
of a nonlinear dynamic system, we design a tuning system by employing the extended
Kalman filter based on cross validation. Independent kernel optimization based on
different measures of data separability are a~so investigated for different kernel-based
classifiers.
Numerous computer experiments using the benchmark datasets verify the theoretical
results, make comparisons among the techniques in measures of classification
accuracy or area under the receiver operating characteristics curve. Computational
requirements, such as the computing time and the number of hyper-parameters, are
also discussed.
All of the presented methods are applied to breast cancer detection from fine-needle
aspiration and in mammograms, as well as screening of knee-joint vibroarthrographic
signals and automatic monitoring of roller bearings with vibration signals. Experimental
results demonstrate the excellence of these methods with improved classification
performance.
For breast cancer detection, instead of only providing a binary diagnostic decision
of 'malignant' or 'benign', we propose methods to assign a measure of confidence
of malignancy to an individual mass, by calculating probabilities of being benign and
malignant with a single classifier or a set of classifiers
Recommended from our members
Internet addiction disorder detection of Chinese college students using several personality questionnaire data and support vector machine
© 2019 The Authors With the unprecedented development of the Internet, it also brings the challenge of Internet Addiction (IA), which is hard to diagnose and cure according to the state-of-art research. In this study, we explored the feasibility of machine learning methods to detect IA. We acquired a dataset consisting of 2397 Chinese college students from the University (Age: 19.17 ± 0.70, Male: 64.17%) who completed Brief Self Control Scale (BSCS), the 11th version of Barratt Impulsiveness Scale (BIS-11), Chinese Big Five Personality Inventory (CBF-PI) and Chen Internet Addiction Scale (CIAS), where CBF-PI includes five sub-features (Openness, Extraversion, Conscientiousness, Agreeableness, and Neuroticism) and BSCS includes three sub-features (Attention, Motor and Non-planning). We applied Student's t-test on the dataset for feature selection and Support Vector Machines (SVMs) including C-SVM and ν-SVM with grid search for the classification and parameters optimization. This work illustrates that SVM is a reliable method for the assessment of IA and questionnaire data analysis. The best detection performance of IA is 96.32% which was obtained by C-SVM in the 6-feature dataset without normalization. Finally, the BIS-11, BSCS, Motor, Neuroticism, Non-planning, and Conscientiousness are shown to be promising features for the detection of IA.Fundamental Research Funds for the Central Universities of Tongji University (22120170043, 22120180542) and the Natural Science Foundation of Shanghai grant number 16JC140130
FAULT DETECTION FRAMEWORK FOR IMBALANCED AND SPARSELY-LABELED DATA SETS USING SELF-ORGANIZING MAPS
While machine learning techniques developed for fault detection usually assume that the classes in the training data are balanced, in real-world applications, this is seldom the case. These techniques also usually require labeled training data, obtaining which is a costly and time-consuming task. In this context, a data-driven framework is developed to detect faults in systems where the condition monitoring data is either imbalanced or consists of mostly unlabeled observations. To mitigate the problem of class imbalance, self-organizing maps (SOMs) are trained in a supervised manner, using the same map size for both classes of data, prior to performing classification. The optimal SOM size for balancing the classes in the data, the size of the neighborhood function, and the learning rate, are determined by performing multiobjective optimization on SOM quality measures such as quantization error and information entropy; and performance measures such as training time and classification error. For training data sets which contain a majority of unlabeled observations, the transductive semi-supervised approach is used to label the neurons of an unsupervised SOM, before performing supervised SOM classification on the test data set. The developed framework is validated using artificial and real-world fault detection data sets
Implementing decision tree-based algorithms in medical diagnostic decision support systems
As a branch of healthcare, medical diagnosis can be defined as finding the disease based on the signs and symptoms of the patient. To this end, the required information is gathered from different sources like physical examination, medical history and general information of the patient. Development of smart classification models for medical diagnosis is of great interest amongst the researchers. This is mainly owing to the fact that the machine learning and data mining algorithms are capable of detecting the hidden trends between features of a database. Hence, classifying the medical datasets using smart techniques paves the way to design more efficient medical diagnostic decision support systems.
Several databases have been provided in the literature to investigate different aspects of diseases. As an alternative to the available diagnosis tools/methods, this research involves machine learning algorithms called Classification and Regression Tree (CART), Random Forest (RF) and Extremely Randomized Trees or Extra Trees (ET) for the development of classification models that can be implemented in computer-aided diagnosis systems. As a decision tree (DT), CART is fast to create, and it applies to both the quantitative and qualitative data. For classification problems, RF and ET employ a number of weak learners like CART to develop models for classification tasks.
We employed Wisconsin Breast Cancer Database (WBCD), Z-Alizadeh Sani dataset for coronary artery disease (CAD) and the databanks gathered in Ghaem Hospital’s dermatology clinic for the response of patients having common and/or plantar warts to the cryotherapy and/or immunotherapy methods. To classify the breast cancer type based on the WBCD, the RF and ET methods were employed. It was found that the developed RF and ET models forecast the WBCD type with 100% accuracy in all cases. To choose the proper treatment approach for warts as well as the CAD diagnosis, the CART methodology was employed. The findings of the error analysis revealed that the proposed CART models for the applications of interest attain the highest precision and no literature model can rival it. The outcome of this study supports the idea that methods like CART, RF and ET not only improve the diagnosis precision, but also reduce the time and expense needed to reach a diagnosis. However, since these strategies are highly sensitive to the quality and quantity of the introduced data, more extensive databases with a greater number of independent parameters might be required for further practical implications of the developed models
Towards Efficient Intrusion Detection using Hybrid Data Mining Techniques
The enormous development in the connectivity among different type of networks poses significant concerns in terms of privacy and security. As such, the exponential expansion in the deployment of cloud technology has produced a massive amount of data from a variety of applications, resources and platforms. In turn, the rapid rate and volume of data creation in high-dimension has begun to pose significant challenges for data management and security. Handling redundant and irrelevant features in high-dimensional space has caused a long-term challenge for network anomaly detection. Eliminating such features with spectral information not only speeds up the classification process, but also helps classifiers make accurate decisions during attack recognition time, especially when coping with large-scale and heterogeneous data such as network traffic data. Furthermore, the continued evolution of network attack patterns has resulted in the emergence of zero-day cyber attacks, which nowadays has considered as a major challenge in cyber security. In this threat environment, traditional security protections like firewalls, anti-virus software, and virtual private networks are not always sufficient. With this in mind, most of the current intrusion detection systems (IDSs) are either signature-based, which has been proven to be insufficient in identifying novel attacks, or developed based on absolute datasets. Hence, a robust mechanism for detecting intrusions, i.e. anomaly-based IDS, in the big data setting has therefore become a topic of importance. In this dissertation, an empirical study has been conducted at the initial stage to identify the challenges and limitations in the current IDSs, providing a systematic treatment of methodologies and techniques. Next, a comprehensive IDS framework has been proposed to overcome the aforementioned shortcomings. First, a novel hybrid dimensionality reduction technique is proposed combining information gain (IG) and principal component analysis (PCA) methods with an ensemble classifier based on three different classification techniques, named IG-PCA-Ensemble. Experimental results show that the proposed dimensionality reduction method contributes more critical features and reduced the detection time significantly. The results show that the proposed IG-PCA-Ensemble approach has also exhibits better performance than the majority of the existing state-of-the-art approaches
A New Design of Multiple Classifier System and its Application to Classification of Time Series Data
To solve the challenging pattern classification problem, machine learning researchers have extensively studied Multiple Classifier Systems (MCSs). The motivations for combining classifiers are found in the literature from the statistical, computational and representational perspectives. Although the results of classifier combination does not always outperform the best individual classifier in the ensemble, empirical studies have demonstrated its superiority for various applications.
A number of viable methods to design MCSs have been developed including bagging, adaboost, rotation forest, and random subspace. They have been successfully applied to solve various tasks. Currently, most of the research is being conducted on the behavior patterns of the base classifiers in the ensemble. However, a discussion from the learning point of view may provide insights into the robust design of MCSs. In this thesis, Generalized Exhaustive Search and Aggregation (GESA) method is developed for this objective. Robust performance is achieved using GESA by dynamically adjusting the trade-off between fitting the training data adequately and preventing the overfitting problem. Besides its learning algorithm, GESA is also distinguished from traditional designs by its architecture and level of decision-making. GESA generates a collection of ensembles and dynamically selects the most appropriate ensemble for decision-making at the local level.
Although GESA provides a good improvement over traditional approaches, it is not very data-adaptive. A data- adaptive design of MCSs demands that the system can adaptively select representations and classifiers to generate effective decisions for aggregation. Another weakness of GESA is its high computation cost which prevents it from being scaled to large ensembles. Generalized Adaptive Ensemble Generation and Aggregation (GAEGA) is an extension of GESA to overcome these two difficulties. GAEGA employs a greedy algorithm to adaptively select the most effective representations and classifiers while excluding the noise ones as much as possible. Consequently, GAEGA can generate fewer ensembles and significantly reduce the computation cost. Bootstrapped Adaptive Ensemble Generation and Aggregation (BAEGA) is another extension of GESA, which is similar with GAEGA in the ensemble generation and decision aggregation. BAEGA adopts a different data manipulation strategy to improve the diversity of the generated ensembles and utilize the information in the data more effectively.
As a specific application, the classification of time series data is chosen for the research reported in this thesis. This type of data contains dynamic information and proves to be more complex than others. Multiple Input Representation-Adaptive Ensemble Generation and Aggregation (MIR-AEGA) is derived from GAEGA for the classification of time series data. MIR-AEGA involves some novel representation methods that proved to be effective for time series data.
All the proposed methods including GESA, GAEGA, MIR-AEGA, and BAEGA are tested on simulated and benchmark data sets from popular data repositories. The experimental results confirm that the newly developed methods are effective and efficient
Study On Clustering Techniques And Application To Microarray Gene Expression Bioinformatics Data
With the explosive growth of the amount of publicly available genomic data, a new field of computer science i.e., bioinformatics has been emerged, focusing on the use of computing systems for efficiently deriving, storing, and analyzing the character strings of genome to help to solve problems in molecular biology. The flood of data from biology, mainly in the form of DNA, RNA and Protein sequences, puts heavy demand on computers and computational scientists. At the same time, it demands a transformation of basic ethos of biological sciences. Hence, Data mining techniques can be used efficiently to explore hidden pattern underlying in biological data. Un-supervised classification, also known as Clustering; which is one of the branch of Data Mining can be applied to biological data and this can result in a better era of rapid medical development and drug discovery. In the past decade, the advent of efficient genome sequencing tools have led to enormous progress in life sciences. Among the most important innovations, microarray technology allows to quantify the expression for thousand of genes simultaneously. The characteristic of these data which makes it different from machine-learning/pattern recognition data includes, a fair amount of random noise, missing values, a dimension in the range of thousands, and a sample size in few dozens. A particular application of the microarray technology is in the area of cancer research, where the goal is for precise and early detection of tumorous cells with high accuracy. The challenge for a biologist and computer scientist is to provide solution based on terms of automation, quality and efficiency