73,597 research outputs found

    Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data

    Get PDF
    In statistical data mining research, datasets often have nonlinearity and high-dimensionality. It has become difficult to analyze such datasets in a comprehensive manner using traditional statistical methodologies. Kernel-based data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize the reliability of results and computational efficiency are required for the analysis of high-dimensional data. In this dissertation, first, a novel wrapper method called SVM-ICOMP-RFE based on hybridized support vector machine (SVM) and recursive feature elimination (RFE) with information-theoretic measure of complexity (ICOMP) is introduced and developed to classify high-dimensional data sets and to carry out subset selection of the variables in the original data space for finding the best for discriminating between groups. Recursive feature elimination (RFE) ranks variables based on the information-theoretic measure of complexity (ICOMP) criterion. Second, a dual variables functional support vector machine approach is proposed. The proposed approach uses both the first and second derivatives of the degradation profiles. The modified floating search algorithm for the repeated variable selection, with newly-added degradation path points, is presented to find a few good variables while reducing the computation time for on-line implementation. Third, a two-stage scheme for the classification of near infrared (NIR) spectral data is proposed. In the first stage, the proposed multi-scale vertical energy thresholding (MSVET) procedure is used to reduce the dimension of the high-dimensional spectral data. In the second stage, a few important wavelet coefficients are selected using the proposed SVM gradient-recursive feature elimination (RFE). Fourth, a novel methodology based on a human decision making process for discriminant analysis called PDCM is proposed. The proposed methodology consists of three basic steps emulating the thinking process: perception, decision, and cognition. In these steps two concepts known as support vector machines for classification and information complexity are integrated to evaluate learning models

    A Surface-based Approach for Classification of 3D Neuroanatomic Structures

    Get PDF
    We present a new framework for 3D surface object classification that combines a powerful shape description method with suitable pattern classification techniques. Spherical harmonic parameterization and normalization techniques are used to describe a surface shape and derive a dual high dimensional landmark representation. A point distribution model is applied to reduce the dimensionality. Fisher\u27s linear discriminants and support vector machines are used for classification. Several feature selection schemes are proposed for learning better classifiers. After showing the effectiveness of this framework using simulated shape data, we apply it to real hippocampal data in schizophrenia and perform extensive experimental studies by examining different combinations of techniques. We achieve best leave-one-out cross-validation accuracies of 93% (whole set, N=56) and 90% (right-handed males, N=39), respectively, which are competitive with the best results in previous studies using different techniques on similar types of data. Furthermore, to help medical diagnosis in practice, we employ a threshold-free receiver operating characteristic (ROC) approach as an alternative evaluation of classification results as well as propose a new method for visualizing discriminative patterns

    Wind turbine multi-fault detection and classification based on SCADA data

    Get PDF
    Due to the increasing installation of wind turbines in remote locations, both onshore and offshore, advanced fault detection and classification strategies have become crucial to accomplish the required levels of reliability and availability. In this work, without using specific tailored devices for condition monitoring but only increasing the sampling frequency in the already available (in all commercial wind turbines) sensors of the Supervisory Control and Data Acquisition (SCADA) system, a data-driven multi-fault detection and classification strategy is developed. An advanced wind turbine benchmark is used. The wind turbine we consider is subject to different types of faults on actuators and sensors. The main challenges of the wind turbine fault detection lie in their non-linearity, unknown disturbances, and significant measurement noise at each sensor. First, the SCADA measurements are pre-processed by group scaling and feature transformation (from the original high-dimensional feature space to a new space with reduced dimensionality) based on multiway principal component analysis through sample-wise unfolding. Then, 10-fold cross-validation support vector machines-based classification is applied. In this work, support vector machines were used as a first choice for fault detection as they have proven their robustness for some particular faults, but at the same time have never accomplished the detection and classification of all the proposed faults considered in this work. To this end, the choice of the features as well as the selection of data are of primary importance. Simulation results showed that all studied faults were detected and classified with an overall accuracy of 98.2%. Finally, it is noteworthy that the prediction speed allows this strategy to be deployed for online (real-time) condition monitoring in wind turbines.Postprint (published version

    Boosting Support Vector Machine pada Data Microarray yang Imbalance

    Get PDF
    Data microarray memainkan peran penting dalam pengklasifikasian hampir semua jenis jaringan kanker. Permasalahan yang seringkali dihadapi dalam klasifikasi menggunakan data microarray adalah high dimensional data dan kelas imbalance. Masalah high dimensional data dapat diatasi dengan menggunakan seleksi fitur Fast Correlated Based Filter. Metode klasifikasi yang digunakan dalam penelitian ini yaitu Support Vector Machines (SVM) karena beberapa kelebihannya, namun SVM sangat sensitif terhadap kelas imbalance. SMOTE merupakan salah satu dalam penanganan data imbalance dengan cara mereplikasi pengamatan pada kelas minoritas. Metode ini seringkali bekerja baik namun terkadang juga terjadi masalah overfitting. Salah satu alternatif lain dalam meningkatkan performansi klasifikasi pada data imbalance yaitu boosting. Metode ini membangun suatu classifier akhir yang kuat dengan menggabungkan sekumpulan SVM sebagai base classifier selama proses iterasi, sehingga dapat meningkatkan performansi klasifikasi. Penelitian ini, bertujuan untuk mengkaji performansi dari SMOTEBoost-SVM jika dibandingkan dengan AdaBoost-SVM dalam melakukan klasifikasi pada data microarray dengan beberapa tingkatan rasio imbalance yang didesain dalam studi simulasi dan penerapan pada data publik microarray. Data publik yang digunakan yaitu data kanker colon dan data myeloma. Hasil analisis yang diperoleh yaitu secara umum, pada studi simulasi, semua classifier mengalami penurunan performansi g-mean seiring bertambahnya rasio kelas imbalance, namun SMOTEBoost-SVM cenderung unggul dan mengalami penurunan performansi lebih kecil (lebih stabil) dibandingkan AdaBoost-SVM, SMOTE-SVM dan SVM. Pada Penerapan data publik, SMOTEBoost SVM juga mengungguli ketiga metode lain berdasarkan ukuran g-mean dan sensitivity. Efek dari seleksi fitur juga dilihat dalam analisis dimana menggunakan fitur-fitur informatif hasil seleksi fitur, menghasilkan performansi yang lebih baik dibandingkan menggunakan seluruh fitur dalam klasifikasi. ======================================================================================================== Microarray data plays an important role in the classification of almost all types of cancer tissue. The problems that often appear in the classification using microarray data are high-dimensional data and imbalanced class. The problem of high-dimensional data can be solved by using Fast Correlated Based Filter (FCBF) feature selection. In this paper, Support Vector Machine (SVM) classifier is used because of its advantages. However, SVM are sensitive with respect to imbalanced class. SMOTE is one of the prepocessing data methods in handling imbalanced class based on sampling approach by increasing the number of samples from the minority class. This method often works well but sometimes it might suffer from over-fitting problem. One other alternative approach in improving the performance of imbalanced data classification is boosting. This method constructs a powerful final classifier by combining a set of SVMs as base classifier during the iteration process. So, it can improve the classification performance. This study aims to see the performance of SMOTEBoost-SVM compared with AdaBoost-SVM in classifying microarray data with several levels of imbalance ratio designed in the simulation study and to apply classification process on public microarray datasets. Colon cancer and myeloma data are used in this study. The result showed that in the simulation study, all classifiers get the g-mean performance deacreasing as the ratio of the imbalanced class is increased, but SMOTEBoost-SVM tend to be superior. Its performance is decrease smaller (more stable) than AdaBoost-SVM, SMOTE-SVM and SVM. In the real data classification, SMOTEBoost-SVM outperforms the others with respect to g-mean and sensitivity metrics.The effect of feature selection is also checked in the analysis. Using informative features obtained in feature selection process gave the better performance than using all feature in the classification process by SVM

    A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Bioactivity profiling using high-throughput <it>in vitro </it>assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex <it>in vitro/in vivo </it>datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.</p> <p>Results</p> <p>The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), NaĂŻve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated <it>in vitro </it>assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.</p> <p>Conclusion</p> <p>We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.</p

    Kernel methods in genomics and computational biology

    Full text link
    Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

    Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space

    Full text link
    We present a framework for discriminative sequence classification where the learner works directly in the high dimensional predictor space of all subsequences in the training set. This is possible by employing a new coordinate-descent algorithm coupled with bounding the magnitude of the gradient for selecting discriminative subsequences fast. We characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). Application of our algorithm to protein remote homology detection and remote fold recognition results in performance comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem
    • …
    corecore