6 research outputs found

    Hypersparse Neural Network Analysis of Large-Scale Internet Traffic

    Full text link
    The Internet is transforming our society, necessitating a quantitative understanding of Internet traffic. Our team collects and curates the largest publicly available Internet traffic data containing 50 billion packets. Utilizing a novel hypersparse neural network analysis of "video" streams of this traffic using 10,000 processors in the MIT SuperCloud reveals a new phenomena: the importance of otherwise unseen leaf nodes and isolated links in Internet traffic. Our neural network approach further shows that a two-parameter modified Zipf-Mandelbrot distribution accurately describes a wide variety of source/destination statistics on moving sample windows ranging from 100,000 to 100,000,000 packets over collections that span years and continents. The inferred model parameters distinguish different network streams and the model leaf parameter strongly correlates with the fraction of the traffic in different underlying network topologies. The hypersparse neural network pipeline is highly adaptable and different network statistics and training models can be incorporated with simple changes to the image filter functions.Comment: 11 pages, 10 figures, 3 tables, 60 citations; to appear in IEEE High Performance Extreme Computing (HPEC) 201

    Inferring hidden features in the Internet (PhD thesis)

    Full text link
    The Internet is a large-scale decentralized system that is composed of thousands of independent networks. In this system, there are two main components, interdomain routing and traffic, that are vital inputs for many tasks such as traffic engineering, security, and business intelligence. However, due to the decentralized structure of the Internet, global knowledge of both interdomain routing and traffic is hard to come by. In this dissertation, we address a set of statistical inference problems with the goal of extending the knowledge of the interdomain-level Internet. In the first part of this dissertation we investigate the relationship between the interdomain topology and an individual network’s inference ability. We first frame the questions through abstract analysis of idealized topologies, and then use actual routing measurements and topologies to study the ability of real networks to infer traffic flows. In the second part, we study the ability of networks to identify which paths flow through their network. We first discuss that answering this question is surprisingly hard due to the design of interdomain routing systems where each network can learn only a limited set of routes. Therefore, network operators have to rely on observed traffic. However, observed traffic can only identify that a particular route passes through its network but not that a route does not pass through its network. In order to solve the routing inference problem, we propose a nonparametric inference technique that works quite accurately. The key idea behind our technique is measuring the distances between destinations. In order to accomplish that, we define a metric called Routing State Distance (RSD) to measure distances in terms of routing similarity. Finally, in the third part, we study our new metric, RSD in detail. Using RSD we address an important and difficult problem of characterizing the set of paths between networks. The collection of the paths across networks is a great source to understand important phenomena in the Internet as path selections are driven by the economic and performance considerations of the networks. We show that RSD has a number of appealing properties that can discover these hidden phenomena

    Centrality measures and analyzing dot-product graphs

    Full text link
    In this thesis we investigate two topics in data mining on graphs; in the first part we investigate the notion of centrality in graphs, in the second part we look at reconstructing graphs from aggregate information. In many graph related problems the goal is to rank nodes based on an importance score. This score is in general referred to as node centrality. In Part I. we start by giving a novel and more efficient algorithm for computing betweenness centrality. In many applications not an individual node but rather a set of nodes is chosen to perform some task. We generalize the notion of centrality to groups of nodes. While group centrality was first formally defined by Everett and Borgatti (1999), we are the first to pose it as a combinatorial optimization problem; find a group of k nodes with largest centrality. We give an algorithm for solving this optimization problem for a general notion of centrality that subsumes various instantiations of centrality that find paths in the graph. We prove that this problem is NP-hard for specific centrality definitions and we provide a universal algorithm for this problem that can be modified to optimize the specific measures. We also investigate the problem of increasing node centrality by adding or deleting edges in the graph. We conclude this part by solving the optimization problem for two specific applications; one for minimizing redundancy in information propagation networks and one for optimizing the expected number of interceptions of a group in a random navigational network. In the second part of the thesis we investigate what we can infer about a bipartite graph if only some aggregate information -- the number of common neighbors among each pair of nodes -- is given. First, we observe that the given data is equivalent to the dot-product of the adjacency vectors of each node. Based on this knowledge we develop an algorithm that is based on SVD-decomposition, that is capable of almost perfectly reconstructing graphs from such neighborhood data. We investigate two versions of this problem, in the versions the dot-product of nodes with themselves, e.g. the node degrees, are either known or hidden

    Macro- and microscopic analysis of the internet economy from network measurements

    Get PDF
    The growth of the Internet impacts multiple areas of the world economy, and it has become a permanent part of the economic landscape both at the macro- and at microeconomic level. On-line traffic and information are currently assets with large business value. Even though commercial Internet has been a part of our lives for more than two decades, its impact on global, and everyday, economy still holds many unknowns. In this work we analyse important macro- and microeconomic aspects of the Internet. First we investigate the characteristics of the interdomain traffic, which is an important part of the macroscopic economy of the Internet. Finally, we investigate the microeconomic phenomena of price discrimination in the Internet. At the macroscopic level, we describe quantitatively the interdomain traffic matrix (ITM), as seen from the perspective of a large research network. The ITM describes the traffic flowing between autonomous systems (AS) in the Internet. It depicts the traffic between the largest Internet business entities, therefore it has an important impact on the Internet economy. In particular, we analyse the sparsity and statistical distribution of the traffic, and observe that the shape of the statistical distribution of the traffic sourced from an AS might be related to congestion within the network. We also investigate the correlations between rows in the ITM. Finally, we propose a novel method to model the interdomain traffic, that stems from first-principles and recognizes the fact that the traffic is a mixture of different Internet applications, and can have regional artifacts. We present and evaluate a tool to generate such matrices from open and available data. Our results show that our first-principles approach is a promising alternative to the existing solutions in this area, which enables the investigation of what-if scenarios and their impact on the Internet economy. At the microscopic level, we investigate the rising phenomena of price discrimination (PD). We find empirical evidences that Internet users can be subject to price and search discrimination. In particular, we present examples of PD on several ecommerce websites and uncover the information vectors facilitating PD. Later we show that crowd-sourcing is a feasible method to help users to infer if they are subject to PD. We also build and evaluate a system that allows any Internet user to examine if she is subject to PD. The system has been deployed and used by multiple users worldwide, and uncovered more examples of PD. The methods presented in the following papers are backed with thorough data analysis and experiments.Internet es hoy en día un elemento crucial en la economía mundial, su constante crecimiento afecta directamente múltiples aspectos tanto a nivel macro- como a nivel microeconómico. Entre otros aspectos, el tráfico de red y la información que transporta se han convertido en un producto de gran valor comercial para cualquier empresa. Sin embargo, más de dos decadas después de su introducción en nuestras vidas y siendo un elemento de vital importancia, el impacto de Internet en la economía global y diaria es un tema que alberga todavía muchas incógnitas que resolver. En esta disertación analizamos importantes aspectos micro y macroeconómicos de Internet. Primero, investigamos las características del tráfico entre Sistemas Autónomos (AS), que es un parte decisiva de la macroeconomía de Internet. A continuacin, estudiamos el controvertido fenómeno microeconómico de la discriminación de precios en Internet. A nivel macroeconómico, mostramos cuantitatívamente la matriz del tráfico entre AS ("Interdomain Traffic Matrix - ITM"), visto desde la perspectiva de una gran red científica. La ITM obtenida empíricamente muestra la cantidad de tráfico compartido entre diferentes AS, las entidades más grandes en Internet, siendo esto uno de los principales aspectos a evaluar en la economiá de Internet. Esto nos permite por ejemplo, analizar diferentes propiedades estadísticas del tráfico para descubrir si la distribución del tráfico producido por un AS está directamente relacionado con la congestión dentro de la red. Además, este estudio también nos permite investigar las correlaciones entre filas de la ITM, es decir, entre diferentes AS. Por último, basándonos en el estudio empírico, proponemos una innovadora solución para modelar el tráfico en una ITM, teniendo en cuenta que el tráfico modelado es dependiente de las particularidades de cada escenario (e.g., distribución de apliaciones, artefactos). Para obtener resultados representativos, la herramienta propuesta para crear estas matrices es evaluada a partir de conjuntos de datos abiertos, disponibles para toda la comunidad científica. Los resultados obtenidos muestran que el método propuesto es una prometedora alternativa a las soluciones de la literatura. Permitiendo así, la nueva investigación de escenarios desconocidos y su impacto en la economía de Internet. A nivel microeconómico, en esta tesis investigamos el fenómeno de la discriminación de precios en Internet ("price discrimination" - PD). Nuestros estudios permiten mostrar pruebas empíricas de que los usuarios de Internet están expuestos a discriminación de precios y resultados de búsquedas. En particular, presentamos ejemplos de PD en varias páginas de comercio electrónico y descubrimos que informacin usan para llevarlo a cabo. Posteriormente, mostramos como una herramienta crowdsourcing puede ayudar a la comunidad de usuarios a inferir que páginas aplican prácticas de PD. Con el objetivo de mitigar esta cada vez más común práctica, publicamos y evaluamos una herramienta que permite al usuario deducir si está siendo víctima de PD. Esta herramienta, con gran repercusión mediática, ha sido usada por multitud de usuarios alrededor del mundo, descubriendo así más ejemplos de discriminación. Por último remarcar que todos los metodos presentados en esta disertación están respaldados por rigurosos análisis y experimentos

    Inferring invisible traffic

    No full text
    corecore