170 research outputs found
Machine learning for network based intrusion detection: an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data.
For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack
of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical
investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes
of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained
whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from
imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective
GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions
Machine learning for network based intrusion detection : an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data
For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
A novel clustering methodology based on modularity optimisation for detecting authorship affinities in Shakespearean era plays
© 2016 Naeni et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays
Development of Context-Aware Recommenders of Sequences of Touristic Activities
En els últims anys, els sistemes de recomanació s'han fet omnipresents a la xarxa. Molts serveis web, inclosa la transmissió de pel·lÃcules, la cerca web i el comerç electrònic, utilitzen sistemes de recomanació per facilitar la presa de decisions. El turisme és una indústria molt representada a la xarxa. Hi ha diversos serveis web (e.g. TripAdvisor, Yelp) que es beneficien de la integració de sistemes recomanadors per ajudar els turistes a explorar destinacions turÃstiques. Això ha augmentat la investigació centrada en la millora dels recomanadors turÃstics per resoldre els principals problemes als quals s'enfronten. Aquesta tesi proposa nous algorismes per a sistemes recomanadors turÃstics que aprenen les preferències dels turistes a partir dels seus missatges a les xarxes socials per suggerir una seqüència d'activitats turÃstiques que s'ajustin a diversos contextes i incloguin activitats afins. Per aconseguir-ho, proposem mètodes per identificar els turistes a partir de les seves publicacions a Twitter, identificant les activitats experimentades en aquestes publicacions i perfilant turistes similars en funció dels seus interessos, informació contextual i perÃodes d'activitat. Aleshores, els perfils d'usuari es combinen amb un algorisme de mineria de regles d'associació per capturar relacions implÃcites entre els punts d'interès de cada perfil. Finalment, es fa un rà nquing de regles i un procés de selecció d'un conjunt d'activitats recomanables. Es va avaluar la precisió de les recomanacions i l'efecte del perfil d'usuari. A més, ordenem el conjunt d'activitats mitjançant un algorisme multi-objectiu per enriquir l'experiència turÃstica. També realitzem una segona fase d'anà lisi dels fluxos turÃstics a les destinacions que és beneficiós per a les organitzacions de gestió de destinacions, que volen entendre la mobilitat turÃstica. En general, els mètodes i algorismes proposats en aquesta tesi es mostren útils en diversos aspectes dels sistemes de recomanació turÃstica.En los últimos años, los sistemas de recomendación se han vuelto omnipresentes en la web. Muchos servicios web, incluida la transmisión de pelÃculas, la búsqueda en la web y el comercio electrónico, utilizan sistemas de recomendación para ayudar a la toma de decisiones. El turismo es una industria altament representada en la web. Hay varios servicios web (e.g. TripAdvisor, Yelp) que se benefician de la inclusión de sistemas recomendadores para ayudar a los turistas a explorar destinos turÃsticos. Esto ha aumentado la investigación centrada en mejorar los recomendadores turÃsticos y resolver los principales problemas a los que se enfrentan. Esta tesis propone nuevos algoritmos para sistemas recomendadores turÃsticos que aprenden las preferencias de los turistas a partir de sus mensajes en redes sociales para sugerir una secuencia de actividades turÃsticas que se alinean con diversos contextos e incluyen actividades afines. Para lograr esto, proponemos métodos para identificar a los turistas a partir de sus publicaciones en Twitter, identificar las actividades experimentadas en estas publicaciones y perfilar turistas similares en función de sus intereses, contexto información y periodos de actividad. Luego, los perfiles de usuario se combinan con un algoritmo de minerÃa de reglas de asociación para capturar relaciones entre los puntos de interés que aparecen en cada perfil. Finalmente, un proceso de clasificación de reglas y selección de actividades produce un conjunto de actividades recomendables. Se evaluó la precisión de las recomendaciones y el efecto de la elaboración de perfiles de usuario. Ordenamos además el conjunto de actividades utilizando un algoritmo multi-objetivo para enriquecer la experiencia turÃstica. También llevamos a cabo un análisis de los flujos turÃsticos en los destinos, lo que es beneficioso para las organizaciones de gestión de destinos, que buscan entender la movilidad turÃstica. En general, los métodos y algoritmos propuestos en esta tesis se muestran útiles en varios aspectos de los sistemas de recomendación turÃstica.In recent years, recommender systems have become ubiquitous on the web. Many web services, including movie streaming, web search and e-commerce, use recommender systems to aid human decision-making. Tourism is one industry that is highly represented on the web. There are several web services (e.g. TripAdvisor, Yelp) that benefit from integrating recommender systems to aid tourists in exploring tourism destinations. This has increased research focused on improving tourism recommender systems and solving the main issues they face. This thesis proposes new algorithms for tourism recommender systems that learn tourist preferences from their social media data to suggest a sequence of touristic activities that align with various contexts and include affine activities. To accomplish this, we propose methods for identifying tourists from their frequent Twitter posts, identifying the activities experienced in these posts, and profiling similar tourists based on their interests, contextual information, and activity periods. User profiles are then combined with an association rule mining algorithm for capturing implicit relationships between points of interest apparent in each profile. Finally, a rule ranking and activity selection process produces a set of recommendable activities. The recommendations were evaluated for accuracy and the effect of user profiling. We further order the set of activities using a multi-objective algorithm to enrich the tourist experience. We also carry out a second-stage analysis of tourist flows at destinations which is beneficial to destination management organisations
seeking to understand tourist mobility. Overall, the methods and algorithms proposed in this thesis are shown to be useful in various aspects of tourism recommender systems
Network anomaly detection research: a survey
Data analysis to identifying attacks/anomalies is a crucial task in anomaly detection and network anomaly detection itself is an important issue in network security. Researchers have developed methods and algorithms for the improvement of the anomaly detection system. At the same time, survey papers on anomaly detection researches are available. Nevertheless, this paper attempts to analyze futher and to provide alternative taxonomy on anomaly detection researches focusing on methods, types of anomalies, data repositories, outlier identity and the most used data type. In addition, this paper summarizes information on application network categories of the existing studies
Evolutionary Computation for Overlapping Community Detection in Social and Graph-based Information
Tesis doctoral inédita leÃda en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de IngenierÃa Informática. Fecha de lectura : 26-06-2017Esta tesis tiene embargado el acceso al texto completo hasta el 26-12-201
A survey on handling computationally expensive multiobjective optimization problems with evolutionary algorithms
This is the author accepted manuscript. The final version is available from Springer Verlag via the DOI in this record.Evolutionary algorithms are widely used for solving multiobjective optimization problems but are often criticized because of a large number of function evaluations needed. Approximations, especially function approximations, also referred to as surrogates or metamodels are commonly used in the literature to reduce the computation time. This paper presents a survey of 45 different recent algorithms proposed in the literature between 2008 and 2016 to handle computationally expensive multiobjective optimization problems. Several algorithms are discussed based on what kind of an approximation such as problem, function or fitness approximation they use. Most emphasis is given to function approximation-based algorithms. We also compare these algorithms based on different criteria such as metamodeling technique and evolutionary algorithm used, type and dimensions of the problem solved, handling constraints, training time and the type of evolution control. Furthermore, we identify and discuss some promising elements and major issues among algorithms in the literature related to using an approximation and numerical settings used. In addition, we discuss selecting an algorithm to solve a given computationally expensive multiobjective optimization problem based on the dimensions in both objective and decision spaces and the computation budget available.The research of Tinkle Chugh was funded by the COMAS Doctoral Program (at the University of Jyväskylä) and FiDiPro Project DeCoMo (funded by Tekes, the Finnish Funding Agency for Innovation), and the research of Dr. Karthik Sindhya was funded by SIMPRO project funded by Tekes as well as DeCoMo
Unsupervised feature selection for anomaly-based network intrusion detection using cluster validity indices.
Master of Science in Computer Engineering. University of KwaZulu-Natal, Durban 2016.In recent years, there has been a rapid increase in Internet usage, which has in turn led to a
rise in malicious network activity. Network Intrusion Detection Systems (NIDS) are tools
that monitor network traffic with the purpose of rapidly and accurately detecting malicious
activity. These systems provide a time window for responding to emerging threats and
attacks aimed at exploiting vulnerabilities that arise from issues such as misconfigured
firewalls and outdated software.
Anomaly-based network intrusion detection systems construct a profile of legitimate or
normal traffic patterns using machine learning techniques, and monitor network traffic for
deviations from the profile, which are subsequently classified as threats or intrusions. Due
to the richness of information contained in network traffic, it is possible to define large
feature vectors from network packets. This often leads to redundant or irrelevant features
being used in network intrusion detection systems, which typically reduces the detection
performance of the system.
The purpose of feature selection is to remove unnecessary or redundant features in a feature
space, thereby improving the performance of learning algorithms and as a result the
classification accuracy. Previous approaches have performed feature selection via optimization
techniques, using the classification accuracy of the NIDS on a subset of the data
as an objective function. While this approach has been shown to improve the performance
of the system, it is unrealistic to assume that labelled training data is available in operational
networks, which precludes the use of classification accuracy as an objective function
in a practical system.
This research proposes a method for feature selection in network intrusion detection that
does not require any access to labelled data. The algorithm uses normalized cluster validity
indices as an objective function that is optimized over the search space of candidate
feature subsets via a genetic algorithm. Feature subsets produced by the algorithm are
shown to improve the classification performance of an anomaly{based network intrusion
detection system over the NSL-KDD dataset. Despite not requiring access to labelled
data, the classification performance of the proposed system approaches that of efective
feature subsets that were derived using labelled training data
- …