53 research outputs found

    Gene selection and classification for cancer microarray data based on machine learning and similarity measures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money.</p> <p>Results</p> <p>To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others.</p> <p>Conclusions</p> <p>On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.</p

    Fault analysis using state-of-the-art classifiers

    Get PDF
    Fault Analysis is the detection and diagnosis of malfunction in machine operation or process control. Early fault analysis techniques were reserved for high critical plants such as nuclear or chemical industries where abnormal event prevention is given utmost importance. The techniques developed were a result of decades of technical research and models based on extensive characterization of equipment behavior. This requires in-depth knowledge of the system and expert analysis to apply these methods for the application at hand. Since machine learning algorithms depend on past process data for creating a system model, a generic autonomous diagnostic system can be developed which can be used for application in common industrial setups. In this thesis, we look into some of the techniques used for fault detection and diagnosis multi-class and one-class classifiers. First we study Feature Selection techniques and the classifier performance is analyzed against the number of selected features. The aim of feature selection is to reduce the impact of irrelevant variables and to reduce computation burden on the learning algorithm. We introduce the feature selection algorithms as a literature survey. Only few algorithms are implemented to obtain the results. Fault data from a Radio Frequency (RF) generator is used to perform fault detection and diagnosis. Comparison between continuous and discrete fault data is conducted for the Support Vector Machines (SVM) and Radial Basis Function Network (RBF) classifiers. In the second part we look into one-class classification techniques and their application to fault detection. One-class techniques were primarily developed to identify one class of objects from all other possible objects. Since all fault occurrences in a system cannot be simulated or recorded, one-class techniques help in identifying abnormal events. We introduce four one-class classifiers and analyze them using Receiver-Operating Characteristic (ROC) curve. We also develop a feature extraction method for the RF generator data which is used to obtain results for one-class classifiers and Radial Basis Function Network two class classification. To apply these techniques for real-time verification, the RIT Fault Prediction software is built. LabView environment is used to build a basic data management and fault detection using Radial Basis Function Network. This software is stand alone and acts as foundation for future implementations

    Ensemble classification and signal image processing for genus Gyrodactylus (Monogenea)

    Get PDF
    This thesis presents an investigation into Gyrodactylus species recognition, making use of machine learning classification and feature selection techniques, and explores image feature extraction to demonstrate proof of concept for an envisaged rapid, consistent and secure initial identification of pathogens by field workers and non-expert users. The design of the proposed cognitively inspired framework is able to provide confident discrimination recognition from its non-pathogenic congeners, which is sought in order to assist diagnostics during periods of a suspected outbreak. Accurate identification of pathogens is a key to their control in an aquaculture context and the monogenean worm genus Gyrodactylus provides an ideal test-bed for the selected techniques. In the proposed algorithm, the concept of classification using a single model is extended to include more than one model. In classifying multiple species of Gyrodactylus, experiments using 557 specimens of nine different species, two classifiers and three feature sets were performed. To combine these models, an ensemble based majority voting approach has been adopted. Experimental results with a database of Gyrodactylus species show the superior performance of the ensemble system. Comparison with single classification approaches indicates that the proposed framework produces a marked improvement in classification performance. The second contribution of this thesis is the exploration of image processing techniques. Active Shape Model (ASM) and Complex Network methods are applied to images of the attachment hooks of several species of Gyrodactylus to classify each species according to their true species type. ASM is used to provide landmark points to segment the contour of the image, while the Complex Network model is used to extract the information from the contour of an image. The current system aims to confidently classify species, which is notifiable pathogen of Atlantic salmon, to their true class with high degree of accuracy. Finally, some concluding remarks are made along with proposal for future work

    A Comparative Analysis for Filter-Based Feature Selection Techniques with Tree-based Classification

    Get PDF
    The selection of features is crucial as an essential pre-processing method, used in the area of research as Data Mining, Text mining, and Image Processing. Raw datasets for machine learning, comprise a combination of multidimensional attributes which have a huge amount of size. They are used for making predictions. If these datasets are used for classification, due to the majority of the presence of features that are inconsistent and redundant, it occupies more resources according to time and produces incorrect results and effects on the classification. With the intention of improving the efficiency and performance of the classification, these features have to be eliminated. A variety of feature subset selection methods had been presented to find and eliminate as many redundant and useless features as feasible. A comparative analysis for filter-based feature selection techniques with tree-based classification is done in this research work. Several feature selection techniques and classifiers are applied to different datasets using the Weka Tool. In this comparative analysis, we evaluated the performance of six different feature selection techniques and their effects on decision tree classifiers using 10-fold cross-validation on three datasets. After the analysis of the result, It has been found that the feature selection method ChiSquaredAttributeEval + Ranker search with Random Forest classifier beats other methods for effective and efficient evaluation and it is applicable to numerous real datasets in several application domain

    Σχεδιασμός και υλοποίηση συστήματος αναγνώρισης προτύπων για ταξινομηση πρωτεομικών σημάτων φασματοσκοπίας μάζας (MS-SPECTRA) ασθενών με καρκίνο του προστάτη

    Get PDF
    Σκοπός της παρούσας διπλωματικής εργασίας ήταν να υλοποιηθεί ένα σύστημα αναγνώρισης προτύπων για το διαχωρισμό μεταξύ υγιών, καλοηθών και κακοηθών όγκων του προστάτη σε πρωτεωμικά δείγματα φασματοσκοπίας μάζας και ο εντοπισμός m/z διαστημάτων όπου πιθανόν να περιέχονται βιοδείκτες σχετιζόμενοι με τον καρκίνο του προστάτη. Για το σκοπό αυτό, χρησιμοποιήθηκαν δύο διαφορετικά σετ δεδομένων, ένα από το Εθνικό Καρκινικό Ινστιτούτο Αμερικής και ένα από το Ιατρικό κέντρο της Virginia, και τα οποία έχουν χρησιμοποιηθεί επανειλημμένα σε έρευνες σχετικά με τον καρκίνο του προστάτη. Λόγο της ιδιομορφίας των προς εξέταση φασμάτων, αρχικά απαιτήθηκε ένα στάδιο προ-επεξεργασίας τους (εξομάλυνση, εκτίμηση θορύβου, εύρεση και στοίχιση κορυφών) ώστε να καταστούν ικανά για περαιτέρω ανάλυση. Στο στάδιο αυτό πειραματιστήκαμε ενδελεχώς έτσι ώστε να καταλήξουμε στις βέλτιστες παραμέτρους για την προ-επεξεργασία των φασμάτων. Στην συνέχεια αναπτύχθηκαν πέντε διαφορετικοί ταξινομητές (MDC, KNN, Bayessian, PNN, SVM) καθώς και ένα σύστημα συνδυασμού αυτών έτσι ώστε να επιτευχθεί μέγιστη απόδοση. Για την εύρεση του βέλτιστου συνδυασμού χαρακτηριστικών υλοποιήθηκαν οι εξαντλητική αναζήτηση, η sequential forward selection (SFS), η sequential backward selection (SBS), η sequential forward floating selection (SFFS) καθώς και η sequential backward floating selection (SBFS). Μετά από συνεχή πειραματισμό με τις παραπάνω τεχνικές και τα μοντέλα μηχανικής μάθησης, πετύχαμε υπό περιπτώσεις ακρίβεια της τάξεως του 95-98% για το πρώτο σετ δεδομένων και της τάξεως του 92-93% για το δεύτερο σετ δεδομένων. Επιπλέον, βασιζόμενοι στα χαρακτηριστικά τα οποία οι ταξινομητές χρησιμοποίησαν κατά κόρον κατά την επίτευξη της βέλτιστης απόδοσής τους, καταλήξαμε σε 6 διαστήματα m/z ως πιθανά να περιέχουν βιοδείκτες που σχετίζονται με τον καρκίνο τους προστάτη. Μετά από συσχετισμό με προηγούμενες έρευνες, εντοπίστηκαν προτεινόμενοι από άλλες ερευνητικές ομάδες βιοδείκτες εντός των προτεινόμενων από εμάς διαστημάτων m/z, κάτι που ενισχύει την θέση μας ως προς την υποψηφιότητα αυτών των διαστημάτων.The aim of this thesis was to implement a pattern recognition system for the discrimination amongst healthy, benign and malignant prostate tumors from proteomic mass spectroscopy samples and to identify m/z intervals of potential biomarkers associated with prostate cancer. For this reason, we used two different data sets, one from the National Cancer Institute of America and one from the East Virginia Medical School, which have been repeatedly used in researches about prostate cancer. Due to the specificity of tested spectra, initially there was a demand of pre-processing (smoothing, noise assessment, finding and peak alignment) to make them suitable for further analysis. At this stage we experimented thoroughly so as to find the optimal parameters for pre-processing of spectra. We then developed five different classifiers (MDC, KNN, Bayessian, PNN, SVM) and a system combining these so as to achieve maximum performance. For finding the optimal combination of features we implemented exhaustive search, sequential forward selection (SFS), sequential backward selection (SBS), sequential forward floating selection (SFFS) and sequential backward floating selection (SBFS). After experimentation with these techniques and models of machine learning we achieved accuracy of 95-98% for the first set of data and of 92-93% for the second data set. Furthermore, based on the features the classifiers used when they achieved their optimal performance, we conclude at 6 different intervals of m/z as possible to contain biomarkers related to prostate cancer. After correlation with previous studies, biomarkers proposed by other research groups where found to be inside our proposed intervals of m/z, something that strengthens our position about the nomination of these intervals

    Computer-aided Diagnosis of Pulmonary Nodules in Thoracic Computed Tomography.

    Full text link
    Lung cancer is the leading cause of cancer death in the United States. The five-year survival rate is 15% because most patients present with advanced disease. If lung cancer is detected and treated at its earliest stage, the five-year survival rate has been reported as high as 92%. Computed tomography (CT) has been shown to be more sensitive than chest radiography in detecting abnormal lung lesions (nodules), especially when they are small. However, each thin-slice thoracic CT scan can contain hundreds of images. Large amounts of image data, radiologist fatigue, and diagnostic uncertainty may lead to missed cancers or unnecessary biopsies. We address these issues by developing a computer-aided diagnosis (CAD) system that would serve as a second reader for radiologists by analyzing nodules and providing a malignancy estimate using computer vision and machine learning techniques. To segment the nodules, we extended the active contour (AC) model to 3D by adding new energy terms. The classification accuracy, quantified by the area (Az) under the receiver operating characteristic curve, was used as the figure-of-merit to guide segmentation parameter optimization. The effect of CT acquisition parameters on 3DAC segmentation was systematically studied by imaging simulated nodules in chest phantoms. We conducted simulation studies to compare the relative performance of feature selection and classification methods and to examine the bias and variance introduced due to limited training sample sizes. We also designed new feature descriptors to describe the nodule surface, which were combined with texture and morphological features extracted from the nodule volume and the surrounding tissue to characterize the nodule. Stepwise feature selection was used to search for the subset of most effective features to be used in the linear discriminant analysis classifier. The CAD system achieved a test Az of 0.86±0.02 in a leave-one-case-out resampling scheme for 256 nodules from 152 patients. We conducted an observer study with six thoracic radiologists and found that their average Az in assessing nodule malignancy increased significantly (p<0.05) from 0.83±0.03 without CAD to 0.85±0.04 with CAD. These results indicate the potential usefulness of CAD as a second reader for radiologists in characterizing lung nodules.Ph.D.Biomedical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/60814/1/tway_1.pd

    Pattern Recognition

    Get PDF
    Pattern recognition is a very wide research field. It involves factors as diverse as sensors, feature extraction, pattern classification, decision fusion, applications and others. The signals processed are commonly one, two or three dimensional, the processing is done in real- time or takes hours and days, some systems look for one narrow object class, others search huge databases for entries with at least a small amount of similarity. No single person can claim expertise across the whole field, which develops rapidly, updates its paradigms and comprehends several philosophical approaches. This book reflects this diversity by presenting a selection of recent developments within the area of pattern recognition and related fields. It covers theoretical advances in classification and feature extraction as well as application-oriented works. Authors of these 25 works present and advocate recent achievements of their research related to the field of pattern recognition

    Developing Novel Computer Aided Diagnosis Schemes for Improved Classification of Mammography Detected Masses

    Get PDF
    Mammography imaging is a population-based breast cancer screening tool that has greatly aided in the decrease in breast cancer mortality over time. Although mammography is the most frequently employed breast imaging modality, its performance is often unsatisfactory with low sensitivity and high false positive rates. This is due to the fact that reading and interpreting mammography images remains difficult due to the heterogeneity of breast tumors and dense overlapping fibroglandular tissue. To help overcome these clinical challenges, researchers have made great efforts to develop computer-aided detection and/or diagnosis (CAD) schemes to provide radiologists with decision-making support tools. In this dissertation, I investigate several novel methods for improving the performance of a CAD system in distinguishing between malignant and benign masses. The first study, we test the hypothesis that handcrafted radiomics features and deep learning features contain complementary information, therefore the fusion of these two types of features will increase the feature representation of each mass and improve the performance of CAD system in distinguishing malignant and benign masses. Regions of interest (ROI) surrounding suspicious masses are extracted and two types of features are computed. The first set consists of 40 radiomic features and the second set includes deep learning (DL) features computed from a pretrained VGG16 network. DL features are extracted from two pseudo color image sets, producing a total of three feature vectors after feature extraction, namely: handcrafted, DL-stacked, DL-pseudo. Linear support vector machines (SVM) are trained using each feature set alone and in combinations. Results show that the fusion CAD system significantly outperforms the systems using either feature type alone (AUC=0.756±0.042 p<0.05). This study demonstrates that both handcrafted and DL futures contain useful complementary information and that fusion of these two types of features increases the CAD classification performance. In the second study, we expand upon our first study and develop a novel CAD framework that fuses information extracted from ipsilateral views of bilateral mammograms using both DL and radiomics feature extraction methods. Each case in this study is represented by four images which includes the craniocaudal (CC) and mediolateral oblique (MLO) view of left and right breast. First, we extract matching ROIs from each of the four views using an ipsilateral matching and bilateral registration scheme to ensure masses are appropriately matched. Next, the handcrafted radiomics features and VGG16 model-generated features are extracted from each ROI resulting in eight feature vectors. Then, after reducing feature dimensionality and quantifying the bilateral asymmetry, we test four fusion methods. Results show that multi-view CAD systems significantly outperform single-view systems (AUC = 0.876±0.031 vs AUC = 0.817±0.026 for CC view and 0.792±0.026 for MLO view, p<0.001). The study demonstrates that the shift from single-view CAD to four-view CAD and the inclusion of both deep transfer learning and radiomics features increases the feature representation of the mass thus improves CAD performance in distinguishing between malignant and benign breast lesions. In the third study, we build upon the first and second studies and investigate the effects of pseudo color image generation in classifying suspicious mammography detected breast lesions as malignant or benign using deep transfer learning in a multi-view CAD scheme. Seven pseudo color image sets are created through a combination of the original grayscale image, a histogram equalized image, a bilaterally filtered image, and a segmented mass image. Using the multi-view CAD framework developed in the previous study, we observe that the two pseudo-color sets created using a segmented mass in one of the three image channels performed significantly better than all other pseudo-color sets (AUC=0.882, p<0.05 for all comparisons and AUC=0.889, p<0.05 for all comparisons). The results of this study support our hypothesis that pseudo color images generated with a segmented mass optimize the mammogram image feature representation by providing increased complementary information to the CADx scheme which results in an increase in the performance in classifying suspicious mammography detected breast lesions as malignant or benign. In summary, each of the studies presented in this dissertation aim to increase the accuracy of a CAD system in classifying suspicious mammography detected masses. Each of these studies takes a novel approach to increase the feature representation of the mass that needs to be classified. The results of each study demonstrate the potential utility of these CAD schemes as an aid to radiologists in the clinical workflow
    corecore