6 research outputs found

    SeLINA: a Self-Learning Insightful Network Analyzer

    Get PDF
    Understanding the behavior of a network from a large scale traffic dataset is a challenging problem. Big data frameworks offer scalable algorithms to extract information from raw data, but often require a sophisticated fine-tuning and a detailed knowledge of machine learning algorithms. To streamline this process, we propose SeLINA (Self-Learning Insightful Network Analyzer), a generic, self-tuning, simple tool to extract knowledge from network traffic measurements. SeLINA includes different data analytics techniques providing self-learning capabilities to state-of-the-art scalable approaches, jointly with parameter auto-selection to off-load the network expert from parameter tuning. We combine both unsupervised and supervised approaches to mine data with a scalable approach. SeLINA embeds mechanisms to check if the new data fits the model, to detect possible changes in the traffic, and to, possibly automatically, trigger model rebuilding. The result is a system that offers human-readable models of the data with minimal user intervention, supporting domain experts in extracting actionable knowledge and highlighting possibly meaningful interpretations. SeLINA's current implementation runs on Apache Spark. We tested it on large collections of realworld passive network measurements from a nationwide ISP, investigating YouTube and P2P traffic. The experimental results confirmed the ability of SeLINA to provide insights and detect changes in the data that suggest further analyse

    SeLINA: a Self-Learning Insightful Network Analyzer

    Get PDF
    Understanding the behavior of a network from a large scale traffic dataset is a challenging problem. Big data frameworks offer scalable algorithms to extract information from raw data, but often require a sophisticated fine-tuning and a detailed knowledge of machine learning algorithms. To streamline this process, we propose SeLINA (Self-Learning Insightful Network Analyzer), a generic, self-tuning, simple tool to extract knowledge from network traffic measurements. SeLINA includes different data analytics techniques providing self-learning capabilities to state-of-the-art scalable approaches, jointly with parameter auto-selection to off-load the network expert from parameter tuning. We combine both unsupervised and supervised approaches to mine data with a scalable approach. SeLINA embeds mechanisms to check if the new data fits the model, to detect possible changes in the traffic, and to, possibly automatically, trigger model rebuilding. The result is a system that offers human-readable models of the data with minimal user intervention, supporting domain experts in extracting actionable knowledge and highlighting possibly meaningful interpretations. SeLINA’s current implementation runs on Apache Spark. We tested it on large collections of realworld passive network measurements from a nationwide ISP, investigating YouTube and P2P traffic. The experimental results confirmed the ability of SeLINA to provide insights and detect changes in the data that suggest further analyses

    Machine Learning and Big Data Methodologies for Network Traffic Monitoring

    Get PDF
    Over the past 20 years, the Internet saw an exponential grown of traffic, users, services and applications. Currently, it is estimated that the Internet is used everyday by more than 3.6 billions users, who generate 20 TB of traffic per second. Such a huge amount of data challenge network managers and analysts to understand how the network is performing, how users are accessing resources, how to properly control and manage the infrastructure, and how to detect possible threats. Along with mathematical, statistical, and set theory methodologies machine learning and big data approaches have emerged to build systems that aim at automatically extracting information from the raw data that the network monitoring infrastructures offer. In this thesis I will address different network monitoring solutions, evaluating several methodologies and scenarios. I will show how following a common workflow, it is possible to exploit mathematical, statistical, set theory, and machine learning methodologies to extract meaningful information from the raw data. Particular attention will be given to machine learning and big data methodologies such as DBSCAN, and the Apache Spark big data framework. The results show that despite being able to take advantage of mathematical, statistical, and set theory tools to characterize a problem, machine learning methodologies are very useful to discover hidden information about the raw data. Using DBSCAN clustering algorithm, I will show how to use YouLighter, an unsupervised methodology to group caches serving YouTube traffic into edge-nodes, and latter by using the notion of Pattern Dissimilarity, how to identify changes in their usage over time. By using YouLighter over 10-month long races, I will pinpoint sudden changes in the YouTube edge-nodes usage, changes that also impair the end users’ Quality of Experience. I will also apply DBSCAN in the deployment of SeLINA, a self-tuning tool implemented in the Apache Spark big data framework to autonomously extract knowledge from network traffic measurements. By using SeLINA, I will show how to automatically detect the changes of the YouTube CDN previously highlighted by YouLighter. Along with these machine learning studies, I will show how to use mathematical and set theory methodologies to investigate the browsing habits of Internauts. By using a two weeks dataset, I will show how over this period, the Internauts continue discovering new websites. Moreover, I will show that by using only DNS information to build a profile, it is hard to build a reliable profiler. Instead, by exploiting mathematical and statistical tools, I will show how to characterize Anycast-enabled CDNs (A-CDNs). I will show that A-CDNs are widely used either for stateless and stateful services. That A-CDNs are quite popular, as, more than 50% of web users contact an A-CDN every day. And that, stateful services, can benefit of A-CDNs, since their paths are very stable over time, as demonstrated by the presence of only a few anomalies in their Round Trip Time. Finally, I will conclude by showing how I used BGPStream an open-source software framework for the analysis of both historical and real-time Border Gateway Protocol (BGP) measurement data. By using BGPStream in real-time mode I will show how I detected a Multiple Origin AS (MOAS) event, and how I studies the black-holing community propagation, showing the effect of this community in the network. Then, by using BGPStream in historical mode, and the Apache Spark big data framework over 16 years of data, I will show different results such as the continuous growth of IPv4 prefixes, and the growth of MOAS events over time. All these studies have the aim of showing how monitoring is a fundamental task in different scenarios. In particular, highlighting the importance of machine learning and of big data methodologies

    Classification algorithms for Big Data with applications in the urban security domain

    Get PDF
    A classification algorithm is a versatile tool, that can serve as a predictor for the future or as an analytical tool to understand the past. Several obstacles prevent classification from scaling to a large Volume, Velocity, Variety or Value. The aim of this thesis is to scale distributed classification algorithms beyond current limits, assess the state-of-practice of Big Data machine learning frameworks and validate the effectiveness of a data science process in improving urban safety. We found in massive datasets with a number of large-domain categorical features a difficult challenge for existing classification algorithms. We propose associative classification as a possible answer, and develop several novel techniques to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. The experiments, run on a real large-scale dataset with more than 4 billion records, confirmed the quality of the approach. To assess the state-of-practice of Big Data machine learning frameworks and streamline the process of integration and fine-tuning of the building blocks, we developed a generic, self-tuning tool to extract knowledge from network traffic measurements. The result is a system that offers human-readable models of the data with minimal user intervention, validated by experiments on large collections of real-world passive network measurements. A good portion of this dissertation is dedicated to the study of a data science process to improve urban safety. First, we shed some light on the feasibility of a system to monitor social messages from a city for emergency relief. We then propose a methodology to mine temporal patterns in social issues, like crimes. Finally, we propose a system to integrate the findings of Data Science on the citizenry’s perception of safety and communicate its results to decision makers in a timely manner. We applied and tested the system in a real Smart City scenario, set in Turin, Italy
    corecore