2,557 research outputs found

    Intrusion Detection Systems Using Adaptive Regression Splines

    Full text link
    Past few years have witnessed a growing recognition of intelligent techniques for the construction of efficient and reliable intrusion detection systems. Due to increasing incidents of cyber attacks, building effective intrusion detection systems (IDS) are essential for protecting information systems security, and yet it remains an elusive goal and a great challenge. In this paper, we report a performance analysis between Multivariate Adaptive Regression Splines (MARS), neural networks and support vector machines. The MARS procedure builds flexible regression models by fitting separate splines to distinct intervals of the predictor variables. A brief comparison of different neural network learning algorithms is also given

    Improving Database Quality through Eliminating Duplicate Records

    Get PDF
    Redundant or duplicate data are the most troublesome problem in database management and applications. Approximate field matching is the key solution to resolve the problem by identifying semantically equivalent string values in syntactically different representations. This paper considers token-based solutions and proposes a general field matching framework to generalize the field matching problem in different domains. By introducing a concept of String Matching Points (SMP) in string comparison, string matching accuracy and efficiency are improved, compared with other commonly-applied field matching algorithms. The paper discusses the development of field matching algorithms from the developed general framework. The framework and corresponding algorithm are tested on a public data set of the NASA publication abstract database. The approach can be applied to address the similar problems in other databases

    Supervised learning-based tagSNP selection for genome-wide disease classifications

    Get PDF
    The article was originally published by BMC Genomics. doi:10.1186/1471-2164-9-S1-S6Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markersResearch supports received from ICASA (Institute for Complex Additive Systems Analysis, a division of New Mexico Tech) and the Radiology Department of Brigham and Women's Hospital (BWH) are gratefully acknowledged. The authors highly appreciate Dr. Liang at SUNY-Buffalo for her invaluable help and insightful discussion during this study and Ms. Kim Lawson at BWH Radiology Department for her manuscript editing and very constructive comments.Supervised Recursive Feature AdditionsSupport Vector bases Recursive Feature Additioncomplex diseasegeneticsdisease prediction

    Influence of Machine Learning vs. Ranking Algorithm on the Critical Dimension

    Get PDF
    Article originally published in International Journal of Future Computer and CommunicationThe critical dimension is the minimum number of features required for a learning machine to perform with “high” accuracy, which for a specific dataset is dependent upon the learning machine and the ranking algorithm. Discovering the critical dimension, if one exists for a dataset, can help to reduce the feature size while maintaining the learning machine’s performance. It is important to understand the influence of learning machines and ranking algorithms on critical dimension to reduce the feature size effectively. In this paper we experiment with three ranking algorithms and three learning machines on several datasets to study their combined effect on the critical dimension. Results show the ranking algorithm has greater influence on the critical dimension than the learning machine.ICASA (Institute for Complex Additive Systems Analysis) of New Mexico Tech and the National Institute of Justice, U.S. Department of Justice (Award No. 2010-DN-BX-K223

    Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data

    Get PDF
    Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods

    Elevated fasting insulin predicts the future incidence of metabolic syndrome: a 5-year follow-up study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>There is controversy about the specific pathophysiology of metabolic syndrome (MS) but several authors have argued that hyperinsulinemia is a key feature of the cluster. We aimed to assess whether the baseline insulin levels could predict the development of MS in a well characterised cohort of otherwise healthy adults who were followed over a five year period.</p> <p>Methods</p> <p>We identified 2, 350 Koreans subjects who did not have MS in 2003 and who were followed up in 2008. The subjects were divided into 4 groups according to the baseline quartiles of fasting insulin, and the predictors of the incidence of MS were analyzed using multivariate regression analysis.</p> <p>Results</p> <p>Over the follow up period, 8.5% of the cohort developed MS. However, 16.4% of the subjects in the highest quartile of the insulin levels developed MS. In a model that included gender, age, the smoking status, the exercise level, alcohol consumption and the systolic blood pressure, the subjects in the highest quartile of the insulin levels had more than a 5 times greater risk of developing MS compared that of the subjects in the lowest quartile. This predictive importance remained significant even after correcting for all the individual features of MS.</p> <p>Conclusions</p> <p>These data suggest that high baseline fasting insulin levels are independent determinants for the future development of MS.</p

    Comparison of feature selection and classification for MALDI-MS data

    Get PDF
    INTRODUCTION: In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data. RESULTS: We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leave-one-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), NaĂŻve Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data. CONCLUSION: Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study

    Gene selection and classification for cancer microarray data based on machine learning and similarity measures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money.</p> <p>Results</p> <p>To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others.</p> <p>Conclusions</p> <p>On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.</p
    • …
    corecore