Search CORE

10,983 research outputs found

On the Relationship between Dependence Tree Classification Error and Bayes Error Rate

Author: Kiran S Balagani
Senior Member IEEE Vir V Phoha
Publication venue
Publication date: 03/04/2020
Field of study

Abstract-Wong and Poo

CiteSeerX

K-Means+ID3 and dependence tree methods for supervised anomaly detection

Author: Balagani Kiran S.
Publication venue: Louisiana Tech Digital Commons
Publication date: 01/04/2008
Field of study

In this dissertation, we present two novel methods for supervised anomaly detection. The first method K-Means+ID3 performs supervised anomaly detection by partitioning the training data instances into k clusters using Euclidean distance similarity. Then, on each cluster representing a density region of normal or anomaly instances, an ID3 decision tree is built. The ID3 decision tree on each cluster refines the decision boundaries by learning the subgroups within a cluster. To obtain a final decision on detection, the k-Means and ID3 decision trees are combined using two rules: (1) the nearest neighbor rule; and (2) the nearest consensus rule. The performance of the K-Means+ID3 is demonstrated over three data sets: (1) network anomaly data, (2) Duffing equation data, and (3) mechanical system data, which contain measurements drawn from three distinct application domains of computer networks, an electronic circuit implementing a forced Duffing equation, and a mechanical mass beam system subjected to fatigue stress, respectively. Results show that the detection accuracy of the K-Means+ID3 method is as high as 96.24 percent on network anomaly data; the total accuracy is as high as 80.01 percent on mechanical system data; and 79.9 percent on Duffing equation data. Further, the performance of K-Means+ID3 is compared with individual k-Means and ID3 methods implemented for anomaly detection. The second method dependence tree based anomaly detection performs supervised anomaly detection using the Bayes classification rule. The class conditional probability densities in the Bayes classification rule are approximated by dependence trees, which represent second-order product approximations of probability densities. We derive the theoretical relationship between dependence tree classification error and Bayes error rate and show that the dependence tree approximation minimizes an upper bound on the Bayes error rate. To improve the classification performance of dependence tree based anomaly detection, we use supervised and unsupervised Maximum Relevance Minimum Redundancy (MRMR) feature selection method to select a set of features that optimally characterize class information. We derive the theoretical relationship between the Bayes error rate and the MRMR feature selection criterion and show that MRMR feature selection criterion minimizes an upper bound on the Bayes error rate. The performance of the dependence tree based anomaly detection method is demonstrated on the benchmark KDD Cup 1999 intrusion detection data set. Results show that the detection accuracies of the dependence tree based anomaly detection method are as high as 99.76 percent in detecting normal traffic, 93.88 percent in detecting denial-of-service attacks, 94.88 percent in detecting probing attacks, 86.40 percent in detecting user-to-root attacks, and 24.44 percent in detecting remote-to-login attacks. Further, the performance of dependence tree based anomaly detection method is compared with the performance of naïve Bayes and ID3 decision tree methods as well as with the performance of two anomaly detection methods reported in recent literature

Louisiana Tech Digital Commons

Bagging and boosting classification trees to predict churn.

Author: Croux Christophe
Lemmens Aurélie
Publication venue
Publication date
Field of study

Bagging; Boosting; Classification; Churn;

Research Papers in Economics

Statistical methods for tissue array images - algorithmic scoring and co-training

Author: Knudsen Beatrice
Linden Michael
Randolph Timothy
Wang Pei
Yan Donghui
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2011
Field of study

Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm - Tissue Array Co-Occurrence Matrix Analysis (TACOMA) - for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists' input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size 30). We give theoretical insights into the success of co-training via thinning of the feature set in a high-dimensional setting when there is "sufficient" redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists' performance in terms of accuracy and repeatability.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS543 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref