77 research outputs found

    Machine Learning and Big Data Methodologies for Network Traffic Monitoring

    Get PDF
    Over the past 20 years, the Internet saw an exponential grown of traffic, users, services and applications. Currently, it is estimated that the Internet is used everyday by more than 3.6 billions users, who generate 20 TB of traffic per second. Such a huge amount of data challenge network managers and analysts to understand how the network is performing, how users are accessing resources, how to properly control and manage the infrastructure, and how to detect possible threats. Along with mathematical, statistical, and set theory methodologies machine learning and big data approaches have emerged to build systems that aim at automatically extracting information from the raw data that the network monitoring infrastructures offer. In this thesis I will address different network monitoring solutions, evaluating several methodologies and scenarios. I will show how following a common workflow, it is possible to exploit mathematical, statistical, set theory, and machine learning methodologies to extract meaningful information from the raw data. Particular attention will be given to machine learning and big data methodologies such as DBSCAN, and the Apache Spark big data framework. The results show that despite being able to take advantage of mathematical, statistical, and set theory tools to characterize a problem, machine learning methodologies are very useful to discover hidden information about the raw data. Using DBSCAN clustering algorithm, I will show how to use YouLighter, an unsupervised methodology to group caches serving YouTube traffic into edge-nodes, and latter by using the notion of Pattern Dissimilarity, how to identify changes in their usage over time. By using YouLighter over 10-month long races, I will pinpoint sudden changes in the YouTube edge-nodes usage, changes that also impair the end users’ Quality of Experience. I will also apply DBSCAN in the deployment of SeLINA, a self-tuning tool implemented in the Apache Spark big data framework to autonomously extract knowledge from network traffic measurements. By using SeLINA, I will show how to automatically detect the changes of the YouTube CDN previously highlighted by YouLighter. Along with these machine learning studies, I will show how to use mathematical and set theory methodologies to investigate the browsing habits of Internauts. By using a two weeks dataset, I will show how over this period, the Internauts continue discovering new websites. Moreover, I will show that by using only DNS information to build a profile, it is hard to build a reliable profiler. Instead, by exploiting mathematical and statistical tools, I will show how to characterize Anycast-enabled CDNs (A-CDNs). I will show that A-CDNs are widely used either for stateless and stateful services. That A-CDNs are quite popular, as, more than 50% of web users contact an A-CDN every day. And that, stateful services, can benefit of A-CDNs, since their paths are very stable over time, as demonstrated by the presence of only a few anomalies in their Round Trip Time. Finally, I will conclude by showing how I used BGPStream an open-source software framework for the analysis of both historical and real-time Border Gateway Protocol (BGP) measurement data. By using BGPStream in real-time mode I will show how I detected a Multiple Origin AS (MOAS) event, and how I studies the black-holing community propagation, showing the effect of this community in the network. Then, by using BGPStream in historical mode, and the Apache Spark big data framework over 16 years of data, I will show different results such as the continuous growth of IPv4 prefixes, and the growth of MOAS events over time. All these studies have the aim of showing how monitoring is a fundamental task in different scenarios. In particular, highlighting the importance of machine learning and of big data methodologies

    A Systems Approach to Minimize Wasted Work in Blockchains

    Get PDF
    Blockchain systems and distributed ledgers are getting increasing attention since the release of Bitcoin. Everyday they make headlines in the news involving economists, scientists, and technologists. The technology invented by Satoshi Nakamoto gave to the world a quantum leap in the fields of distributed systems and digital currencies. Even so, there are still some problems regarding the architecture in most existing blockchain systems. One of the main challenges in these systems is the structure of the network topology and how peers disseminate messages between them, this leads to problems regarding the system scalability and the efficiency of the transaction and blocks propagation, wasting computational power, energy and network resources. In this work we propose a novel solution to tackle these limitations. We propose the design of membership and message dissemination protocols, based on the state-ofart, that will boost the efficiency of the overlay network that support the interactions between miners, reducing the number of exchanged messages and the used bandwidth. This solution also reduces the computational power and energy consumed across all nodes in the network, since the nodes avoid to process redundant network messages, and, becoming aware of mined blocks faster, avoid to perform computations over an outdated chain configuration

    Methods for High-Dimensional Spatial Data: Dimension Reduction and Covariance Approximation

    Get PDF
    In spatial statistics, because quantities are correlated based on their relative positions in space, data is modeled as a single realization of a multivariate stochastic process. Spatial data can be high-dimensional either through a large number of observed variables per location, or through a large number of observed locations. The two are often handled differently, with the former addressed through dimension reduction and the latter addressed through appropriate modeling of the spatial correlation between locations. The main body of this dissertation is a three-part work. Parts 2 and 3 pertain to the many variables problem, proposing novel methods of dimension reduction for spatial data. Part 4 pertains to the many locations problem, using state-of-the-art techniques to analyze a massive satellite data set, improving on the current usage of the data

    Mining interesting events on large and dynamic data

    Get PDF
    Nowadays, almost every human interaction produces some form of data. These data are available either to every user, e.g.~images uploaded on Flickr or to users with specific privileges, e.g.~transactions in a bank. The huge amount of these produced data can easily overwhelm humans that try to make sense out of it. The need for methods that will analyse the content of the produced data, identify emerging topics in it and present the topics to the users has emerged. In this work, we focus on emerging topics identification over large and dynamic data. More specifically, we analyse two types of data: data published in social networks like Twitter, Flickr etc.~and structured data stored in relational databases that are updated through continuous insertion queries. In social networks, users post text, images or videos and annotate each of them with a set of tags describing its content. We define sets of co-occurring tags to represent topics and track the correlations of co-occurring tags over time. We split the tags to multiple nodes and make each node responsible of computing the correlations of its assigned tags. We implemented our approach in Storm, a distributed processing engine, and conducted a user study to estimate the quality of our results. In structured data stored in relational databases, top-k group-by queries are defined and an emerging topic is considered to be a change in the top-k results. We maintain the top-k result sets in the presence of updates minimising the interaction with the underlying database. We implemented and experimentally tested our approach.Heutzutage entstehen durch fast jede menschliche Aktion und Interaktion Daten. Fotos werden auf Flickr bereitgestellt, Neuigkeiten über Twitter verbreitet und Kontakte in Linkedin und Facebook verwaltet; neben traditionellen Vorgängen wie Banktransaktionen oder Flugbuchungen, die Änderungen in Datenbanken erzeugen. Solch eine riesige Menge an Daten kann leicht überwältigend sein bei dem Versuch die Essenz dieser Daten zu extrahieren. Neue Methoden werden benötigt, um Inhalt der Daten zu analysieren, neu entstandene Themen zu identifizieren und die so gewonnenen Erkenntnisse dem Benutzer in einer übersichtlichen Art und Weise zu präsentieren. In dieser Arbeit werden Methoden zur Identifikation neuer Themen in großen und dynamischen Datenmengen behandelt. Dabei werden einerseits die veröffentlichten Daten aus sozialen Netzwerken wie Twitter und Flickr und andererseits strukturierte Daten aus relationalen Datenbanken, welche kontinuierlich aktualisiert werden, betrachtet. In sozialen Netzwerken stellen die Benutzer Texte, Bilder oder Videos online und beschreiben diese für andere Nutzer mit Schlagworten, sogenannten Tags. Wir interpretieren Gruppen von zusammen auftretenden Tags als eine Art Thema und verfolgen die Beziehung bzw. Korrelation dieser Tags über einen gewissen Zeitraum. Abrupte Anstiege in der Korrelation werden als Hinweis auf Trends aufgefasst. Die eigentlich Aufgabe, das Zählen von zusammen auftretenden Tags zur Berechnung von Korrelationsmaßen, wird dabei auf eine Vielzahl von Computerknoten verteilt. Die entwickelten Algorithmen wurden in Storm, einem neuartigen verteilten Datenstrommanagementsystem, implementiert und bzgl. Lastbalancierung und anfallender Netzwerklast sorgfältig evaluiert. Durch eine Benutzerstudie wird darüber hinaus gezeigt, dass die Qualität der gewonnenen Trends höher ist als die Qualität der Ergebnisse bestehender Systeme. In strukturierten Daten von relationalen Datenbanksystemen werden Beste-k Ergebnislisten durch Aggregationsanfragen in SQL definiert. Interessant dabei sind eintretende Änderungen in diesen Listen, was als Ereignisse (Trends) aufgefasst wird. In dieser Arbeit werden Methoden präsentiert diese Ergebnislisten möglichst effizient instand zu halten, um Interaktionen mit der eigentlichen Datenbank zu minimieren

    Monitoring Internet censorship: the case of UBICA

    Get PDF
    As a consequence of the recent debate about restrictions in the access to content on the Internet, a strong motivation has arisen for censorship monitoring: an independent, publicly available and global watch on Internet censorship activities is a necessary goal to be pursued in order to guard citizens' right of access to information. Several techniques to enforce censorship on the Internet are known in literature, differing in terms of transparency towards the user, selectivity in blocking specific resources or whole groups of services, collateral effects outside the administrative borders of their intended application. Monitoring censorship is also complicated by the dynamic nature of multiple aspects of this phenomenon, the number and diversity of resources targeted by censorship and its global scale. In the present Thesis an analysis of literature on internet censorship and available solutions for censorship detection has been performed, characterizing censorship enforcement techniques and censorship detection techniques and tools. The available platforms and tools for censorship detection have been found falling short of providing a comprehensive monitoring platform able to manage a diverse set of measurement vantage points and a reporting interface continuously updated with the results of automated censorship analysis. The candidate proposes a design of such a platform, UBICA, along with a prototypical implementation whose effectiveness has been experimentally validated in global monitoring campaigns. The results of the validation are discussed, confirming the effectiveness of the proposed design and suggesting future enhancements and research

    On the dynamics of interdomain routing in the Internet

    Full text link
    The routes used in the Internet's interdomain routing system are a rich information source that could be exploited to answer a wide range of questions.  However, analyzing routes is difficult, because the fundamental object of study is a set of paths. In this dissertation, we present new analysis tools -- metrics and methods -- for analyzing paths, and apply them to study interdomain routing in the Internet over long periods of time. Our contributions are threefold. First, we build on an existing metric (Routing State Distance) to define a new metric that allows us to measure the similarity between two prefixes with respect to the state of the global routing system. Applying this metric over time yields a measure of how the set of paths to each prefix varies at a given timescale. Second, we present PathMiner, a system to extract large scale routing events from background noise and identify the AS (Autonomous System) or AS-link most likely responsible for the event. PathMiner is distinguished from previous work in its ability to identify and analyze large-scale events that may re-occur many times over long timescales. We show that it is scalable, being able to extract significant events from multiple years of routing data at a daily granularity. Finally, we equip Routing State Distance with a new set of tools for identifying and characterizing unusually-routed ASes. At the micro level, we use our tools to identify clusters of ASes that have the most unusual routing at each time. We also show that analysis of individual ASes can expose business and engineering strategies of the organizations owning the ASes.  These strategies are often related to content delivery or service replication. At the macro level, we show that the set of ASes with the most unusual routing defines discernible and interpretable phases of the Internet's evolution. Furthermore, we show that our tools can be used to provide a quantitative measure of the "flattening" of the Internet

    The case of Ferbritas Cadastre Information System

    Get PDF
    The processes of mobilization of land for infrastructures of public and private domain are developed according to proper legal frameworks and systematically confronted with the impoverished national situation as regards the cadastral identification and regularization, which leads to big inefficiencies, sometimes with very negative impact to the overall effectiveness. This project report describes Ferbritas Cadastre Information System (FBSIC) project and tools, which in conjunction with other applications, allow managing the entire life-cycle of Land Acquisition and Cadastre, including support to field activities with the integration of information collected in the field, the development of multi-criteria analysis information, monitoring all information in the exploration stage, and the automated generation of outputs. The benefits are evident at the level of operational efficiency, including tools that enable process integration and standardization of procedures, facilitate analysis and quality control and maximize performance in the acquisition, maintenance and management of registration information and expropriation (expropriation projects). Therefore, the implemented system achieves levels of robustness, comprehensiveness, openness, scalability and reliability suitable for a structural platform. The resultant solution, FBSIC, is a fit-for-purpose cadastre information system rooted in the field of railway infrastructures. FBSIC integrating nature of allows: to accomplish present needs and scale to meet future services; to collect, maintain, manage and share all information in one common platform, and transform it into knowledge; to relate with other platforms; to increase accuracy and productivity of business processes related with land property management
    • …
    corecore