290 research outputs found

    Machine Learning and Big Data Methodologies for Network Traffic Monitoring

    Get PDF
    Over the past 20 years, the Internet saw an exponential grown of traffic, users, services and applications. Currently, it is estimated that the Internet is used everyday by more than 3.6 billions users, who generate 20 TB of traffic per second. Such a huge amount of data challenge network managers and analysts to understand how the network is performing, how users are accessing resources, how to properly control and manage the infrastructure, and how to detect possible threats. Along with mathematical, statistical, and set theory methodologies machine learning and big data approaches have emerged to build systems that aim at automatically extracting information from the raw data that the network monitoring infrastructures offer. In this thesis I will address different network monitoring solutions, evaluating several methodologies and scenarios. I will show how following a common workflow, it is possible to exploit mathematical, statistical, set theory, and machine learning methodologies to extract meaningful information from the raw data. Particular attention will be given to machine learning and big data methodologies such as DBSCAN, and the Apache Spark big data framework. The results show that despite being able to take advantage of mathematical, statistical, and set theory tools to characterize a problem, machine learning methodologies are very useful to discover hidden information about the raw data. Using DBSCAN clustering algorithm, I will show how to use YouLighter, an unsupervised methodology to group caches serving YouTube traffic into edge-nodes, and latter by using the notion of Pattern Dissimilarity, how to identify changes in their usage over time. By using YouLighter over 10-month long races, I will pinpoint sudden changes in the YouTube edge-nodes usage, changes that also impair the end users’ Quality of Experience. I will also apply DBSCAN in the deployment of SeLINA, a self-tuning tool implemented in the Apache Spark big data framework to autonomously extract knowledge from network traffic measurements. By using SeLINA, I will show how to automatically detect the changes of the YouTube CDN previously highlighted by YouLighter. Along with these machine learning studies, I will show how to use mathematical and set theory methodologies to investigate the browsing habits of Internauts. By using a two weeks dataset, I will show how over this period, the Internauts continue discovering new websites. Moreover, I will show that by using only DNS information to build a profile, it is hard to build a reliable profiler. Instead, by exploiting mathematical and statistical tools, I will show how to characterize Anycast-enabled CDNs (A-CDNs). I will show that A-CDNs are widely used either for stateless and stateful services. That A-CDNs are quite popular, as, more than 50% of web users contact an A-CDN every day. And that, stateful services, can benefit of A-CDNs, since their paths are very stable over time, as demonstrated by the presence of only a few anomalies in their Round Trip Time. Finally, I will conclude by showing how I used BGPStream an open-source software framework for the analysis of both historical and real-time Border Gateway Protocol (BGP) measurement data. By using BGPStream in real-time mode I will show how I detected a Multiple Origin AS (MOAS) event, and how I studies the black-holing community propagation, showing the effect of this community in the network. Then, by using BGPStream in historical mode, and the Apache Spark big data framework over 16 years of data, I will show different results such as the continuous growth of IPv4 prefixes, and the growth of MOAS events over time. All these studies have the aim of showing how monitoring is a fundamental task in different scenarios. In particular, highlighting the importance of machine learning and of big data methodologies

    A Survey on Big Data for Network Traffic Monitoring and Analysis

    Get PDF
    Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require real-time and scalable approaches. Anomaly detection and security mechanisms require to quickly identify and react to unpredictable events while processing millions of heterogeneous events. At last, the system has to collect, store, and process massive sets of historical data for post-mortem analysis. Those are precisely the challenges faced by general big data approaches: Volume, Velocity, Variety, and Veracity. This survey brings together NTMA and big data. We catalog previous work on NTMA that adopt big data approaches to understand to what extent the potential of big data is being explored in NTMA. This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA. Finally, we provide guidelines for future work, discussing lessons learned, and research directions

    Detecting IP prefix hijack events using BGP activity and AS connectivity analysis

    Get PDF
    The Border Gateway Protocol (BGP), the main component of core Internet connectivity, suffers vulnerability issues related to the impersonation of the ownership of IP prefixes for Autonomous Systems (ASes). In this context, a number of studies have focused on securing the BGP through several techniques, such as monitoring-based, historical-based and statistical-based behavioural models. In spite of the significant research undertaken, the proposed solutions cannot detect the IP prefix hijack accurately or even differentiate it from other types of attacks that could threaten the performance of the BGP. This research proposes three novel detection methods aimed at tracking the behaviour of BGP edge routers and detecting IP prefix hijacks based on statistical analysis of variance, the attack signature approach and a classification-based technique. The first detection method uses statistical analysis of variance to identify hijacking behaviour through the normal operation of routing information being exchanged among routers and their behaviour during the occurrence of IP prefix hijacking. However, this method failed to find any indication of IP prefix hijacking because of the difficulty of having raw BGP data hijacking-free. The research also proposes another detection method that parses BGP advertisements (announcements) and checks whether IP prefixes are announced or advertised by more than one AS. If so, events are selected for further validation using Regional Internet Registry (RIR) databases to determine whether the ASes announcing the prefixes are owned by the same organisation or different organisations. Advertisements for the same IP prefix made by ASes owned by different organisations are subsequently identified as hijacking events. The proposed algorithm of the detection method was validated using the 2008 YouTube Pakistan hijack event; the analysis demonstrates that the algorithm qualitatively increases the accuracy of detecting IP prefix hijacks. The algorithm is very accurate as long as the RIRs (Regional Internet Registries) are updated concurrently with hijacking detection. The detection method and can be integrated and work with BGP routers separately. Another detection method is proposed to detect IP prefix hijacking using a combination of signature-based (parsing-based) and classification-based techniques. The parsing technique is used as a pre-processing phase before the classification-based method. Some features are extracted based on the connectivity behaviour of the suspicious ASes given by the parsing technique. In other words, this detection method tracks the behaviour of the suspicious ASes and follows up with an analysis of their interaction with directly and indirectly connected neighbours based on a set of features extracted from the ASPATH information about the suspicious ASes. Before sending the extracted feature values to the best five classifiers that can work with the specifications of an implemented classification dataset, the detection method computes the similarity between benign and malicious behaviours to determine to what extent the classifiers can distinguish suspicious behaviour from benign behaviour and then detect the hijacking. Evaluation tests of the proposed algorithm demonstrated that the detection method was able to detect the hijacks with 96% accuracy and can be integrated and work with BGP routers separately.Saudi Cultural Burea

    Evaluation of machine learning techniques for intrusion detection in software defined networking

    Get PDF
    Abstract. The widespread growth of the Internet paved the way for the need of a new network architecture which was filled by Software Defined Networking (SDN). SDN separated the control and data planes to overcome the challenges that came along with the rapid growth and complexity of the network architecture. However, centralizing the new architecture also introduced new security challenges and created the demand for stronger security measures. The focus is on the Intrusion Detection System (IDS) for a Distributed Denial of Service (DDoS) attack which is a serious threat to the network system. There are several ways of detecting an attack and with the rapid growth of machine learning (ML) and artificial intelligence, the study evaluates several ML algorithms for detecting DDoS attacks on the system. Several factors have an effect on the performance of ML based IDS in SDN. Feature selection, training dataset, and implementation of the classifying models are some of the important factors. The balance between usage of resources and the performance of the implemented model is important. The model implemented in the thesis uses a dataset created from the traffic flow within the system and models being used are Support Vector Machine (SVM), Naive-Bayes, Decision Tree and Logistic Regression. The accuracy of the models has been over 95% apart from Logistic Regression which has 90% accuracy. The ML based algorithm has been more accurate than the non-ML based algorithm. It learns from different features of the traffic flow to differentiate between normal traffic and attack traffic. Most of the previously implemented ML based IDS are based on public datasets. Using a dataset created from the flow of the experimental environment allows training of the model from a real-time dataset. However, the experiment only detects the traffic and does not take any action. However, these promising results can be used for further development of the model

    On the cyber security issues of the internet infrastructure

    Get PDF
    The Internet network has received huge attentions by the research community. At a first glance, the network optimization and scalability issues dominate the efforts of researchers and vendors. Many results have been obtained in the last decades: the Internet’s architecture is optimized to be cheap, robust and ubiquitous. In contrast, such a network has never been perfectly secure. During all its evolution, the security threats of the Internet persist as a transversal and endless topic. Nowadays, the Internet network hosts a multitude of mission critical activities. The electronic voting systems and financial services are carried out through it. Governmental institutions, financial and business organizations depend on the performance and the security of the Internet. This role confers to the Internet network a critical characterization. At the same time, the Internet network is a vector of malicious activities, like Denial of Service attacks; many reports of attacks can be found in both academic outcomes and daily news. In order to mitigate this wide range of issues, many research efforts have been carried out in the past decades; unfortunately, the complex architecture and the scale of the Internet make hard the evaluation and the adoption of such proposals. In order to improve the security of the Internet, the research community can benefit from sharing real network data. Unfortunately, privacy and security concerns inhibit the release of these data: its suffices to imagine the big amount of private information (e.g., political preferences or religious belief) it is possible to get while reading the Internet packets exchanged between users and web services. This scenario motivates my research, and represents the context of this dissertation which contributes to the analysis of the security issues of the Internet infrastructures and describes relevant security proposals. In particular, the main outcomes described in this dissertation are: • the definition of a secure routing protocol for the Internet network able to provide cryptographic guarantees against false route announcement and invalid path attack; • the definition of a new obfuscation technique that allow the research community to publicly release their real network flows with formal guarantees of security and privacy; • the evidence of a new kind of leakage of sensitive informations obtained hacking the models used by sundry Machine Learning Algorithms

    Comparing a Hybrid Multi-layered Machine Learning Intrusion Detection System to Single-layered and Deep Learning Models

    Get PDF
    Advancements in computing technology have created additional network attack surface, allowed the development of new attack types, and increased the impact caused by an attack. Researchers agree, current intrusion detection systems (IDSs) are not able to adapt to detect these new attack forms, so alternative IDS methods have been proposed. Among these methods are machine learning-based intrusion detection systems. This research explores the current relevant studies related to intrusion detection systems and machine learning models and proposes a new hybrid machine learning IDS model consisting of the Principal Component Analysis (PCA) and Support Vector Machine (SVM) learning algorithms. The NSL-KDD Dataset, benchmark dataset for IDSs, is used for comparing the models’ performance. The performance accuracy and false-positive rate of the hybrid model are compared to the results of the model’s individual algorithmic components to determine which components most impact attack prediction performance. The performance metrics of the hybrid model are also compared to two deep learning Autoencoder Neuro Network models and the results found that the complexity of the model does not add to the performance accuracy. The research showed that pre-processing and feature selection impact the predictive accuracy across models. Future research recommendations were to implement the proposed hybrid IDS model into a live network for testing and analysis, and to focus research into the pre-processing algorithms that improve performance accuracy, and lower false-positive rate. This research indicated that pre-processing and feature selection/feature extraction can increase model performance accuracy and decrease false-positive rate helping businesses to improve network security

    Attack2vec: Leveraging temporal word embeddings to understand the evolution of cyberattacks

    Full text link
    Despite the fact that cyberattacks are constantly growing in complexity, the research community still lacks effective tools to easily monitor and understand them. In particular, there is a need for techniques that are able to not only track how prominently certain malicious actions, such as the exploitation of specific vulnerabilities, are exploited in the wild, but also (and more importantly) how these malicious actions factor in as attack steps in more complex cyberattacks. In this paper we present ATTACK2VEC, a system that uses temporal word embeddings to model how attack steps are exploited in the wild, and track how they evolve. We test ATTACK2VEC on a dataset of billions of security events collected from the customers of a commercial Intrusion Prevention System over a period of two years, and show that our approach is effective in monitoring the emergence of new attack strategies in the wild and in flagging which attack steps are often used together by attackers (e.g., vulnerabilities that are frequently exploited together). ATTACK2VEC provides a useful tool for researchers and practitioners to better understand cyberattacks and their evolution, and use this knowledge to improve situational awareness and develop proactive defenses.Accepted manuscrip

    ATTACK2VEC: Leveraging Temporal Word Embeddings to Understand the Evolution of Cyberattacks

    Full text link
    Despite the fact that cyberattacks are constantly growing in complexity, the research community still lacks effective tools to easily monitor and understand them. In particular, there is a need for techniques that are able to not only track how prominently certain malicious actions, such as the exploitation of specific vulnerabilities, are exploited in the wild, but also (and more importantly) how these malicious actions factor in as attack steps in more complex cyberattacks. In this paper we present ATTACK2VEC, a system that uses temporal word embeddings to model how attack steps are exploited in the wild, and track how they evolve. We test ATTACK2VEC on a dataset of billions of security events collected from the customers of a commercial Intrusion Prevention System over a period of two years, and show that our approach is effective in monitoring the emergence of new attack strategies in the wild and in flagging which attack steps are often used together by attackers (e.g., vulnerabilities that are frequently exploited together). ATTACK2VEC provides a useful tool for researchers and practitioners to better understand cyberattacks and their evolution, and use this knowledge to improve situational awareness and develop proactive defenses

    Novel graph analytics for enhancing data insight

    No full text
    Graph analytics is a fast growing and significant field in the visualization and data mining community, which is applied on numerous high-impact applications such as, network security, finance, and health care, providing users with adequate knowledge across various patterns within a given system. Although a series of methods have been developed in the past years for the analysis of unstructured collections of multi-dimensional points, graph analytics has only recently been explored. Despite the significant progress that has been achieved recently, there are still many open issues in the area, concerning not only the performance of the graph mining algorithms, but also producing effective graph visualizations in order to enhance human perception. The current thesis deals with the investigation of novel methods for graph analytics, in order to enhance data insight. Towards this direction, the current thesis proposes two methods so as to perform graph mining and visualization. Based on previous works related to graph mining, the current thesis suggests a set of novel graph features that are particularly efficient in identifying the behavioral patterns of the nodes on the graph. The specific features proposed, are able to capture the interaction of the neighborhoods with other nodes on the graph. Moreover, unlike previous approaches, the graph features introduced herein, include information from multiple node neighborhood sizes, thus capture long-range correlations between the nodes, and are able to depict the behavioral aspects of each node with high accuracy. Experimental evaluation on multiple datasets, shows that the use of the proposed graph features for the graph mining procedure, provides better results than the use of other state-of-the-art graph features. Thereafter, the focus is laid on the improvement of graph visualization methods towards enhanced human insight. In order to achieve this, the current thesis uses non-linear deformations so as to reduce visual clutter. Non-linear deformations have been previously used to magnify significant/cluttered regions in data or images for reducing clutter and enhancing the perception of patterns. Extending previous approaches, this work introduces a hierarchical approach for non-linear deformation that aims to reduce visual clutter by magnifying significant regions, and leading to enhanced visualizations of one/two/three-dimensional datasets. In this context, an energy function is utilized, which aims to determine the optimal deformation for every local region in the data, taking the information from multiple single-layer significance maps into consideration. The problem is subsequently transformed into an optimization problem for the minimization of the energy function under specific spatial constraints. Extended experimental evaluation provides evidence that the proposed hierarchical approach for the generation of the significance map surpasses current methods, and manages to effectively identify significant regions and deliver better results. The thesis is concluded with a discussion outlining the major achievements of the current work, as well as some possible drawbacks and other open issues of the proposed approaches that could be addressed in future works.Open Acces
    • …
    corecore