164 research outputs found

    Boosted Hidden Markov Models for Malware Detection

    Get PDF
    Digital security is an important issue today, and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has found widespread application in the field of pattern matching and malware detection is hidden Markov models (HMMs). Since HMM training is a hill climb technique, we can often significantly improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets and we analyze the results in terms of effectiveness and efficiency

    Hidden Markov Models with Random Restarts vs Boosting for Malware Detection

    Full text link
    Effective and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has been used widely in the field of pattern matching in general-and malware detection in particular-is hidden Markov models (HMMs). HMM training is based on a hill climb, and hence we can often improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets. We find that random restarts perform surprisingly well in comparison to boosting. Only in the most difficult "cold start" cases (where training data is severely limited) does boosting appear to offer sufficient improvement to justify its higher computational cost in the scoring phase

    Reduction of False Positives in Intrusion Detection Based on Extreme Learning Machine with Situation Awareness

    Get PDF
    Protecting computer networks from intrusions is more important than ever for our privacy, economy, and national security. Seemingly a month does not pass without news of a major data breach involving sensitive personal identity, financial, medical, trade secret, or national security data. Democratic processes can now be potentially compromised through breaches of electronic voting systems. As ever more devices, including medical machines, automobiles, and control systems for critical infrastructure are increasingly networked, human life is also more at risk from cyber-attacks. Research into Intrusion Detection Systems (IDSs) began several decades ago and IDSs are still a mainstay of computer and network protection and continue to evolve. However, detecting previously unseen, or zero-day, threats is still an elusive goal. Many commercial IDS deployments still use misuse detection based on known threat signatures. Systems utilizing anomaly detection have shown great promise to detect previously unseen threats in academic research. But their success has been limited in large part due to the excessive number of false positives that they produce. This research demonstrates that false positives can be better minimized, while maintaining detection accuracy, by combining Extreme Learning Machine (ELM) and Hidden Markov Models (HMM) as classifiers within the context of a situation awareness framework. This research was performed using the University of New South Wales - Network Based 2015 (UNSW-NB15) data set which is more representative of contemporary cyber-attack and normal network traffic than older data sets typically used in IDS research. It is shown that this approach provides better results than either HMM or ELM alone and with a lower False Positive Rate (FPR) than other comparable approaches that also used the UNSW-NB15 data set

    Imbalanced data classification and its application in cyber security

    Get PDF
    Cyber security, also known as information technology security or simply as information security, aims to protect government organizations, companies and individuals by defending their computers, servers, electronic systems, networks, and data from malicious attacks. With the advancement of client-side on the fly web content generation techniques, it becomes easier for attackers to modify the content of a website dynamically and gain access to valuable information. The impact of cybercrime to the global economy is now more than ever, and it is growing day by day. Among various types of cybercrimes, financial attacks are widely spread and the financial sector is among most targeted. Both corporations and individuals are losing a huge amount of money each year. The majority portion of financial attacks is carried out by banking malware and web-based attacks. The end users are not always skilled enough to differentiate between injected content and actual contents of a webpage. Designing a real-time security system for ensuring a safe browsing experience is a challenging task. Some of the existing solutions are designed for client side and all the users have to install it in their system, which is very difficult to implement. In addition, various platforms and tools are used by organizations and individuals, therefore, different solutions are needed to be designed. The existing server-side solution often focuses on sanitizing and filtering the inputs. It will fail to detect obfuscated and hidden scripts. This is a realtime security system and any significant delay will hamper user experience. Therefore, finding the most optimized and efficient solution is very important. To ensure an easy installation and integration capabilities of any solution with the existing system is also a critical factor to consider. If the solution is efficient but difficult to integrate, then it may not be a feasible solution for practical use. Unsupervised and supervised data classification techniques have been widely applied to design algorithms for solving cyber security problems. The performance of these algorithms varies depending on types of cyber security problems and size of datasets. To date, existing algorithms do not achieve high accuracy in detecting malware activities. Datasets in cyber security and, especially those from financial sectors, are predominantly imbalanced datasets as the number of malware activities is significantly less than the number of normal activities. This means that classifiers for imbalanced datasets can be used to develop supervised data classification algorithms to detect malware activities. Development of classifiers for imbalanced data sets has been subject of research over the last decade. Most of these classifiers are based on oversampling and undersampling techniques and are not efficient in many situations as such techniques are applied globally. In this thesis, we develop two new algorithms for solving supervised data classification problems in imbalanced datasets and then apply them to solve malware detection problems. The first algorithm is designed using the piecewise linear classifiers by formulating this problem as an optimization problem and by applying the penalty function method. More specifically, we add more penalty to the objective function for misclassified points from minority classes. The second method is based on the combination of the supervised and unsupervised (clustering) algorithms. Such an approach allows one to identify areas in the input space where minority classes are located and to apply local oversampling or undersampling. This approach leads to the design of more efficient and accurate classifiers. The proposed algorithms are tested using real-world datasets. Results clearly demonstrate superiority of newly introduced algorithms. Then we apply these algorithms to design classifiers to detect malwares.Doctor of Philosoph

    An Efficient Intrusion Detection Approach Utilizing Various WEKA Classifiers

    Get PDF
    Detection of Intrusion is an essential expertise business segment as well as a dynamic area of study and expansion caused by its requirement. Modern day intrusion detection systems still have these limitations of time sensitivity. The main requirement is to develop a system which is able of handling large volume of network data to detect attacks more accurately and proactively. Research conducted by on the KDDCUP99 dataset resulted in a various set of attributes for each of the four major attack types. Without reducing the number of features, detecting attack patterns within the data is more difficult for rule generation, forecasting, or classification. The goal of this research is to present a new method that Compare results of appropriately categorized and inaccurately categorized as proportions and the features chosen. In this research paper we explained our approach “An Efficient Intrusion Detection Approach Utilizing Various WEKA Classifiers” which is proposed to enhance the competence of recognition of intrusion employing different WEKA classifiers on processed KDDCUP99 dataset. During the experiment we employed Adaboost, J48, JRip, NaiveBayes and Random Tree classifiers to categorize the different attacks from the processed KDDCUP99. Keywords: Classifier, Data Mining, IDS, Network Security, Attacks, Cyber Securit

    Machine learning for network based intrusion detection: an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data.

    Get PDF
    For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions

    Towards adaptive anomaly detection systems using boolean combination of hidden Markov models

    Get PDF
    Anomaly detection monitors for significant deviations from normal system behavior. Hidden Markov Models (HMMs) have been successfully applied in many intrusion detection applications, including anomaly detection from sequences of operating system calls. In practice, anomaly detection systems (ADSs) based on HMMs typically generate false alarms because they are designed using limited representative training data and prior knowledge. However, since new data may become available over time, an important feature of an ADS is the ability to accommodate newly-acquired data incrementally, after it has originally been trained and deployed for operations. Incremental re-estimation of HMM parameters raises several challenges. HMM parameters should be updated from new data without requiring access to the previously-learned training data, and without corrupting previously-learned models of normal behavior. Standard techniques for training HMM parameters involve iterative batch learning, and hence must observe the entire training data prior to updating HMM parameters. Given new training data, these techniques must restart the training procedure using all (new and previously-accumulated) data. Moreover, a single HMM system for incremental learning may not adequately approximate the underlying data distribution of the normal process, due to the many local maxima in the solution space. Ensemble methods have been shown to alleviate knowledge corruption, by combining the outputs of classifiers trained independently on successive blocks of data. This thesis makes contributions at the HMM and decision levels towards improved accuracy, efficiency and adaptability of HMM-based ADSs. It first presents a survey of techniques found in literature that may be suitable for incremental learning of HMM parameters, and assesses the challenges faced when these techniques are applied to incremental learning scenarios in which the new training data is limited and abundant. Consequently, An efficient alternative to the Forward-Backward algorithm is first proposed to reduce the memory complexity without increasing the computational overhead of HMM parameters estimation from fixed-size abundant data. Improved techniques for incremental learning of HMM parameters are then proposed to accommodate new data over time, while maintaining a high level of performance. However, knowledge corruption caused by a single HMM with a fixed number of states remains an issue. To overcome such limitations, this thesis presents an efficient system to accommodate new data using a learn-and-combine approach at the decision level. When a new block of training data becomes available, a new pool of base HMMs is generated from the data using a different number of HMM states and random initializations. The responses from the newly-trained HMMs are then combined to those of the previously-trained HMMs in receiver operating characteristic (ROC) space using novel Boolean combination (BC) techniques. The learn-and-combine approach allows to select a diversified ensemble of HMMs (EoHMMs) from the pool, and adapts the Boolean fusion functions and thresholds for improved performance, while it prunes redundant base HMMs. The proposed system is capable of changing its desired operating point during operations, and this point can be adjusted to changes in prior probabilities and costs of errors. During simulations conducted for incremental learning from successive data blocks using both synthetic and real-world system call data sets, the proposed learn-and-combine approach has been shown to achieve the highest level of accuracy than all related techniques. In particular, it can sustain a significantly higher level of accuracy than when the parameters of a single best HMM are re-estimated for each new block of data, using the reference batch learning and the proposed incremental learning techniques. It also outperforms static fusion techniques such as majority voting for combining the responses of new and previously-generated pools of HMMs. Ensemble selection techniques have been shown to form compact EoHMMs for operations, by selecting diverse and accurate base HMMs from the pool while maintaining or improving the overall system accuracy. Pruning has been shown to prevents pool sizes from increasing indefinitely with the number of data blocks acquired over time. Therefore, the storage space for accommodating HMMs parameters and the computational costs of the selection techniques are reduced, without negatively affecting the overall system performance. The proposed techniques are general in that they can be employed to adapt HMM-based systems to new data, within a wide range of application domains. More importantly, the proposed Boolean combination techniques can be employed to combine diverse responses from any set of crisp or soft one- or two-class classifiers trained on different data or features or trained according to different parameters, or from different detectors trained on the same data. In particular, they can be effectively applied when training data is limited and test data is imbalanced

    Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods

    Get PDF
    Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques. The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns. The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other. The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques. The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy

    Performance Evaluation of Network Anomaly Detection Systems

    Get PDF
    Nowadays, there is a huge and growing concern about security in information and communication technology (ICT) among the scientific community because any attack or anomaly in the network can greatly affect many domains such as national security, private data storage, social welfare, economic issues, and so on. Therefore, the anomaly detection domain is a broad research area, and many different techniques and approaches for this purpose have emerged through the years. Attacks, problems, and internal failures when not detected early may badly harm an entire Network system. Thus, this thesis presents an autonomous profile-based anomaly detection system based on the statistical method Principal Component Analysis (PCADS-AD). This approach creates a network profile called Digital Signature of Network Segment using Flow Analysis (DSNSF) that denotes the predicted normal behavior of a network traffic activity through historical data analysis. That digital signature is used as a threshold for volume anomaly detection to detect disparities in the normal traffic trend. The proposed system uses seven traffic flow attributes: Bits, Packets and Number of Flows to detect problems, and Source and Destination IP addresses and Ports, to provides the network administrator necessary information to solve them. Via evaluation techniques, addition of a different anomaly detection approach, and comparisons to other methods performed in this thesis using real network traffic data, results showed good traffic prediction by the DSNSF and encouraging false alarm generation and detection accuracy on the detection schema. The observed results seek to contribute to the advance of the state of the art in methods and strategies for anomaly detection that aim to surpass some challenges that emerge from the constant growth in complexity, speed and size of today’s large scale networks, also providing high-value results for a better detection in real time.Atualmente, existe uma enorme e crescente preocupação com segurança em tecnologia da informação e comunicação (TIC) entre a comunidade científica. Isto porque qualquer ataque ou anomalia na rede pode afetar a qualidade, interoperabilidade, disponibilidade, e integridade em muitos domínios, como segurança nacional, armazenamento de dados privados, bem-estar social, questões econômicas, e assim por diante. Portanto, a deteção de anomalias é uma ampla área de pesquisa, e muitas técnicas e abordagens diferentes para esse propósito surgiram ao longo dos anos. Ataques, problemas e falhas internas quando não detetados precocemente podem prejudicar gravemente todo um sistema de rede. Assim, esta Tese apresenta um sistema autônomo de deteção de anomalias baseado em perfil utilizando o método estatístico Análise de Componentes Principais (PCADS-AD). Essa abordagem cria um perfil de rede chamado Assinatura Digital do Segmento de Rede usando Análise de Fluxos (DSNSF) que denota o comportamento normal previsto de uma atividade de tráfego de rede por meio da análise de dados históricos. Essa assinatura digital é utilizada como um limiar para deteção de anomalia de volume e identificar disparidades na tendência de tráfego normal. O sistema proposto utiliza sete atributos de fluxo de tráfego: bits, pacotes e número de fluxos para detetar problemas, além de endereços IP e portas de origem e destino para fornecer ao administrador de rede as informações necessárias para resolvê-los. Por meio da utilização de métricas de avaliação, do acrescimento de uma abordagem de deteção distinta da proposta principal e comparações com outros métodos realizados nesta tese usando dados reais de tráfego de rede, os resultados mostraram boas previsões de tráfego pelo DSNSF e resultados encorajadores quanto a geração de alarmes falsos e precisão de deteção. Com os resultados observados nesta tese, este trabalho de doutoramento busca contribuir para o avanço do estado da arte em métodos e estratégias de deteção de anomalias, visando superar alguns desafios que emergem do constante crescimento em complexidade, velocidade e tamanho das redes de grande porte da atualidade, proporcionando também alta performance. Ainda, a baixa complexidade e agilidade do sistema proposto contribuem para que possa ser aplicado a deteção em tempo real
    corecore