6 research outputs found

    Machine learning for network based intrusion detection: an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data.

    Get PDF
    For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions

    Combination Methods for Automatic Document Organization

    Get PDF
    Automatic document classification and clustering are useful for a wide range of applications such as organizing Web, intranet, or portal pages into topic directories, filtering news feeds or mail, focused crawling on the Web or in intranets, and many more. This thesis presents ensemble-based meta methods for supervised learning (i.e., classification based on a small amount of hand-annotated training documents). In addition, we show how these techniques can be carried forward to clustering based on unsupervised learning (i.e., automatic structuring of document corpora without training data). The algorithms are applied in a restrictive manner, i.e., by leaving out some \u27uncertain\u27 documents (rather than assigning them to inappropriate topics or clusters with low confidence). We show how restrictive meta methods can be used to combine different document representations in the context of Web document classification and author recognition. As another application for meta methods we study the combination of difierent information sources in distributed environments, such as peer-to-peer information systems. Furthermore we address the problem of semi-supervised classification on document collections using retraining. A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. The results of our systematic evaluation on real world data show the viability of the proposed approaches.Automatische Dokumentklassifikation und Clustering sind für eine Vielzahl von Anwendungen von Bedeutung, wie beispielsweise Organisation von Web-, Intranet- oder Portalseiten in thematische Verzeichnisse, Filterung von Nachrichtenmeldungen oder Emails, fokussiertes Crawling im Web oder in Intranets und vieles mehr. Diese Arbeit untersucht Ensemble-basierte Metamethoden für Supervised Learning (d.h. Klassifikation basierend auf einer kleinen Anzahl von manuell annotierten Trainingsdokumenten). Weiterhin zeigen wir, wie sich diese Techniken auf Clustering basierend auf Unsupervised Learning (d.h. die automatische Strukturierung von Dokumentkorpora ohne Trainingsdaten) übertragen lassen. Dabei wenden wir die Algorithmen in restriktiver Form an, d.h. wir treffen keine Aussage über eine Teilmenge von "unsicheren" Dokumenten (anstatt sie mit niedriger Konfidenz ungeeigneten Themen oder Clustern zuzuordnen). Wir verwendenen restriktive Metamethoden um unterschiedliche Dokumentrepräsentationen, im Kontext der Klassifikation von Webdokumentem und der Autorenerkennung, miteinander zu kombinieren. Als weitere Anwendung von Metamethoden untersuchen wir die Kombination von unterschiedlichen Informationsquellen in verteilten Umgebungen wie Peer-to-Peer Informationssystemen. Weiterhin betrachten wir das Problem der Semi-Supervised Klassifikation von Dokumentsammlungen durch Retraining. Eine mögliche Anwendung ist fokussiertesWeb Crawling, wo wir mit sehr wenigen, manuell ausgewählten Trainingsdokumenten starten, die durch Hinzufugen von ursprünglich nicht klassifizierten Dokumenten ergänzt werden. Die Resultate unserer systematischen Evaluation auf realen Daten zeigen das gute Leistungsverhalten unserer Methoden

    Diseño. análisis y evaluación de conjuntos de clasificadores basados en redes de neuronas

    Get PDF
    Una de las áreas de investigación que, dentro del marco del Aprendizaje Automático, más atención ha recibido durante las últimas décadas ha sido el diseño de conjuntos de clasificadores. Bajo este denominador se engloban un gran número de algoritmos cuyo objetivo es la construcción de un clasificador robusto haciendo uso de clasificadores más simples denominados clasificadores base. Aunque el uso de los conjuntos de clasificadores se puede argumentar desde diversas perspectivas, la justificación más evidente se encuentra en el comportamiento humano. Antes de tomar una decisión importante es habitual pedir opinión a varios expertos para así tener mayor certeza de que la opción elegida es la más adecuada. Diversos estudios han demostrado que el éxito de cualquier conjunto de clasificadores viene determinado por la precisión y la diversidad de los clasificadores que lo integran. En otras palabras, para que un conjunto de clasificadores mejore la precisión de cualquiera de sus miembros se requiere que éstos sean precisos y diversos. Sin embargo, encontrar clasificadores base que, de forma simultánea, satisfagan ambos requisitos no es una tarea fácil. Por ello, en este trabajo se presentan dos nuevas arquitecturas de conjuntos de clasificadores en una de las cuales, sin obviar la diversidad, se fomenta la precisión de los clasificadores base, mientras que en la otra se fomenta la diversidad frente a la precisión. Las diferencias y la complementariedad existente entre ambas arquitecturas permitirá analizar la influencia que, en el comportamiento global del conjunto, tiene la primacía de una de estas particularidades frente a la otra. Aunque, en el mundo real, la mayor parte de los problemas de clasificación engloban a más de dos categorías, muchos de los conjuntos de clasificadores propuestos en la Bibliografía fueron originalmente concebidos para resolver problemas dicotómicos. En ocasiones, el algoritmo que rige el comportamiento de estos modelos puede extrapolarse a problemas multiclase. Sin embargo, en otros muchos casos, el problema multiclase sólo se puede resolver descomponiendo el problema original en subproblemas binarios. Además, la mayor parte de los modelos propuestos, han sido evaluados sobre dominios artificiales en los que el número de atributos con los que se describen los ejemplos es relativamente pequeño. A pesar de esta tendencia, existen un gran número de dominios reales en los que los ejemplos están descritos por cientos o incluso miles de características. La necesidad de disponer de nuevos métodos de clasificación capaces de resolver problemas reales marca uno de los objetivos de esta Tesis Doctoral. Así, las arquitecturas que se proponen en este trabajo han sido concebidas explícitamente para la resolución de problemas en los que el número de categorías es finito y superior a dos y en los que los ejemplos están descritos por un elevado número de atributos. Partiendo de estas dos singularidades, se pretende acotar, en la medida de lo posible, la complejidad y el coste computacional inherentes a la resolución de este tipo de problemas. La viabilidad de las arquitecturas propuestas se ha determinado experimentalmente. Así, el estudio realizado contempla un exhaustivo análisis en el que, sobre distintos dominios, se analiza el comportamiento de las arquitecturas propuestas y se compara con el logrado por algunos de los modelos de clasificación más referenciados en la Bibliografía. -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------The design of Ensemble of Classifiers has been one of the most active research areas in the field of Machine Learning for the last decades. In this area, many different algorithms have been proposed in order to create a more robust classifier that consists of simpler classifiers named base classifiers. Although the use of ensemble of classifiers can be justified by many different reasons, the most obvious justification is related with human decision making process. Before making a decision, it is common to ask several experts to be sure that the chosen option is the optimal. Many studies have demonstrated that the success of any ensemble of classifiers is related to the accuracy and diversity of the different base classifiers of the ensemble. In other words, an ensemble of classifiers could improve the accuracy of any of its individual members if they are accurate and diverse. However, obtaining base classifiers which satisfy both requirements simultaneously is not an easy task. For this reason, this work presents two new ensembles of classifiers: One of these ensembles prioritizes the accuracy of the base classifiers (taking also into account the diversity) and the other promotes diversity over accuracy. These ensembles are different but complement each other, so it will be possible to analyze the influence of these requirements over the global performance of the ensemble. The number of applications that require multiclass categorization is huge in the real world. However, many of the studies related to supervised learning are focused on the resolution of binary problems. Some machine learning algorithms can then be naturally extended to handle the multiclass case. For other algorithms, a direct extension to the multiclass case may be problematic. Typically, in such cases, the multiclass problem is reduced to multiple binary classification problems that can be solved separately. In addition, most of these models have been evaluated in artificial domains in which the number of features used to describe the examples is relatively small. Despite this, there are many real domains in which the examples are described by hundreds or even thousands of features. For this reason, one of the goals of this thesis is the creation of new classification methods for real world. Thus, the ensembles proposed in this work have been designed to be applicable to real domains in which each example is labeled with one of several categories and is described by a large number of features. Taking these characteristics into account, the computational complexity and cost of the proposed methods need to be reduced as much as possible. The viability of the proposed ensembles has been proved empirically. Thus, this thesis makes a comprehensive analysis in which, taking into account different domains, the performance of the proposed ensembles is analyzed and compared with other wellknown classification methods

    Machine learning for network based intrusion detection : an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data

    Get PDF
    For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore